<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Brent Ozar PLFdatamining | Brent Ozar PLF</title>
	<atom:link href="http://www.brentozar.com/archive/tag/datamining/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.brentozar.com</link>
	<description>Your technology pain-relief experts.</description>
	<lastBuildDate>Wed, 08 Feb 2012 14:15:03 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>How to Import the StackOverflow XML into SQL Server</title>
		<link>http://www.brentozar.com/archive/2009/06/how-to-import-the-stackoverflow-xml-into-sql-server/</link>
		<comments>http://www.brentozar.com/archive/2009/06/how-to-import-the-stackoverflow-xml-into-sql-server/#comments</comments>
		<pubDate>Mon, 08 Jun 2009 12:11:20 +0000</pubDate>
		<dc:creator>Brent Ozar</dc:creator>
				<category><![CDATA[SQL Server]]></category>
		<category><![CDATA[SQLServerPedia Syndication]]></category>
		<category><![CDATA[datamining]]></category>
		<category><![CDATA[stackoverflow]]></category>
		<category><![CDATA[tsql]]></category>
		<category><![CDATA[xml]]></category>

		<guid isPermaLink="false">http://www.brentozar.com/?p=4131</guid>
		<description><![CDATA[Want to play around with the StackOverflow database export?  Here&#8217;s how to import the XML files into SQL Server, and some notes about the tables and data schema. Script to Import StackOverflow XML to SQL Server This T-SQL script will create six stored procedures: usp_ETL_Load_Badges usp_ETL_Load_Comments usp_ETL_Load_Posts usp_ETL_Load_Users usp_ETL_Load_Votes usp_ETL_Load_PostsTags (which isn&#8217;t one of the...<p>...<br /><i>Upcoming free webcasts: <a href="https://brentozarevents.webex.com/brentozarevents/onstage/g.php?t=a&d=663314175">SQL and SSDs: A Valentine's Day Love Story</a> and <a href="https://brentozarevents.webex.com/brentozarevents/onstage/g.php?t=a&d=664876357">Anatomy of the SQL Server Log File</a></i>.</p>
]]></description>
			<content:encoded><![CDATA[<p>Want to play around with the <a href="http://blog.stackoverflow.com/2009/06/stack-overflow-creative-commons-data-dump/">StackOverflow database export</a>?  Here&#8217;s how to import the XML files into SQL Server, and some notes about the tables and data schema.</p>
<h3>Script to Import StackOverflow XML to SQL Server</h3>
<p><a href="http://cached.brentozar.com/Import_StackOverflow_XML.zip">This T-SQL script</a> will create six stored procedures:</p>
<ul>
<li>usp_ETL_Load_Badges</li>
<li>usp_ETL_Load_Comments</li>
<li>usp_ETL_Load_Posts</li>
<li>usp_ETL_Load_Users</li>
<li>usp_ETL_Load_Votes</li>
<li>usp_ETL_Load_PostsTags (which isn&#8217;t one of the StackOverflow tables &#8211; more on that in a minute)</li>
</ul>
<p>The XML import code is from an <a href="http://searchsqlserver.techtarget.com/tip/0,289483,sid87_gci1347994_mem1,00.html">excellent XML tutorial by Denny Cherry</a>.  The scripts create a table (named Badges, Comments, Posts, Users, Votes) for each XML file.  The schema matches the XML file with one exception &#8211; I added an identity field to the Badges table.  The rest already had Id fields.  The tables don&#8217;t have any indexes to speed querying. I would highly recommend that you not change the schema of any of these tables, because I&#8217;ll be giving out more scripts over the coming days and weeks that rely on the base tables.  If you want to add more data, add additional tables.  Plus this will keep your importing clean anyway &#8211; you can dump and reload the StackOverflow data repeatedly as long as you keep that data separate.</p>
<p>After importing, the database is about 2gb of data.  Be aware that depending on your database&#8217;s recovery model and how you run these stored procs, your log file may be 2gb as well. None of the sentences in this paragraph blend together well, which bothers me but not quite enough to stop publishing the blog entry. Anyway, on we go.</p>
<p>If the table already exists when the stored proc runs, the table contents are deleted using the TRUNCATE TABLE command, which requires hefty permissions.  If you don&#8217;t have admin rights on the box, substitute DELETE for the five TRUNCATE TABLE commands.  Using DELETE will take significantly longer to run.  For reference, with TRUNCATE TABLE, the stored procs take around 10 minutes on my faster machines, and around half an hour on my slower virtual machines.</p>
<p>These stored procs only work for the new database dump released on Monday morning, not the one released last week.  If you get invalid XML errors while importing, you&#8217;ve got the older database dump.  Go get the fresh hotness.</p>
<p>Now for some schema notes, and I&#8217;m going to go out of alphabetical order because everything links back to the Users table.  I&#8217;m only going to cover the fields that aren&#8217;t immediately obvious:</p>
<h3>Users Table</h3>
<ul>
<li>Id &#8211; primary key, identity field from the original StackOverflow database.  Id 1 is &#8220;Community&#8221;, which is a special user that denotes community ownership, like wiki questions and answers.</li>
<li>LastAccessDate &#8211; this is useful because it tells you when the data export was last updated.  If you&#8217;re doing queries for things like the last 30 days, check the most recent date here.</li>
<li>Age &#8211; the user enters this manually, so it&#8217;s not terribly reliable <a href="http://www.brentozar.com/archive/2009/06/stackoverflow-data-mining-cleansing-the-data/">as I discovered earlier</a>.</li>
<li>AboutMe &#8211; I&#8217;m using an nvarchar(max) field here, but you can go with a shorter field like nvarchar(2000).</li>
<li>UpVotes and DownVotes &#8211; the number of votes this user has cast.</li>
</ul>
<h3>Posts Table</h3>
<p>In StackOverflow, questions and answers are both considered posts.  If a record has a null ParentId field, then it&#8217;s a question.  Otherwise, it&#8217;s an answer, and to find the matching question, join the ParentId field up to Posts.Id.</p>
<ul>
<li>Id &#8211; primary key, identity field from the original StackOverflow database.</li>
<li>Title &#8211; the title of the question.  Answer titles will be null.</li>
<li>OwnerUserId &#8211; joins back to Users.Id.  If OwnerUserId = 1, that&#8217;s the community user, meaning it&#8217;s a wiki question or answer.</li>
<li>AcceptedAnswerId &#8211; for questions, this points to the Post.Id of the officially accepted answer.  This isn&#8217;t necessarily the highest-voted answer, but the one the questioner accepted.</li>
<li>Tags &#8211; okay, time to blow out of the bullet points for a second.</li>
</ul>
<p>StackOverflow limits you to five tags per question (answers aren&#8217;t tagged), and all five are stored in this field.  For example, for question <a href="http://stackoverflow.com/questions/305223/jon-skeet-facts">305223</a>, the Tags field is &#8220;&lt;offtopic&gt;&lt;fun&gt;&lt;not-programming-related&gt;&lt;jon-skeet&gt;&#8221;.  It&#8217;s up to you to normalize these.  If you&#8217;d like to normalize them out into a child table, check out the usp_ETL_Load_PostsTags stored proc, which creates a PostsTags table with PostId and Tag fields.  Each Posts record (questions only) will then have several child records in PostsTags.</p>
<p>Next, check the contents of the Tag field carefully.  StackOverflow allows periods in the tag, like the <a href="http://stackoverflow.com/questions/tagged/.net">.NET tag</a> and <a href="http://stackoverflow.com/questions/tagged/asp.net">ASP.NET tag</a>.  However, in the database, these are stored as &#8220;aspûnet&#8221;.  Just something to be aware of.</p>
<h3>Comments Table</h3>
<ul>
<li>Id &#8211; primary key, identity field from the original StackOverflow database.</li>
<li>PostId &#8211; the post parent for this comment.  Joins to the Post.Id field.</li>
<li>UserId &#8211; who left the comment.  Joins to the User.Id field.</li>
</ul>
<h3>Badges Table</h3>
<ul>
<li>Id &#8211; an identity field for a primary key.  This number is meaningless &#8211; I just added it for some referential integrity.</li>
<li>UserId &#8211; joins back to Users.Id to show whose badge it is.</li>
<li>Name &#8211; the name of the Badge, like Teacher or Nice Answer.</li>
<li>CreationDate &#8211; when the user achieved the badge.</li>
</ul>
<h3>Votes Table</h3>
<p>This stores the votes cast on posts, but the key field is VoteTypeId.  The VoteType table wasn&#8217;t included in the export, so this table isn&#8217;t too useful yet, but if the guys give me the OK I&#8217;ll post the contents of that table here.  The Votes table doesn&#8217;t include *who* cast the votes, and I&#8217;ve got my hands full analyzing the other tables anyway, so I haven&#8217;t been interested in the VoteTypes yet.</p>
<p>All of the Id fields except for Badges.Id are from StackOverflow&#8217;s original database.  In theory, these numbers will not change, which means if you build your own child table structures like UserBaconPreferences, and you join via User.Id, you should be able to blow away and reload the Users table with every new StackOverflow database dump.  That&#8217;s the theory, but in reality, you shouldn&#8217;t rely on anybody else&#8217;s ID fields, because there&#8217;s no reason to believe these won&#8217;t completely change down the road.  Who knows &#8211; Jeff might <a href="http://www.codinghorror.com/blog/archives/000817.html">switch over to GUIDs as primary keys</a>.</p>
<h3>Sample Questions Query</h3>
<p>Once you&#8217;ve got it all together, you can do some fun stuff.  Let&#8217;s look at some overall statistics about <strong>questions</strong> (not answers):</p>

<div class="wp_syntax"><div class="code"><pre class="tsql" style="font-family:monospace;"><span style="color: #0000FF;">SELECT</span> <span style="color: #0000FF;">COALESCE</span><span style="color: #808080;">&#40;</span><span style="color: #FF00FF;">COUNT</span><span style="color: #808080;">&#40;</span><span style="color: #0000FF;">DISTINCT</span> p.<span style="color: #202020;">ID</span><span style="color: #808080;">&#41;</span>,<span style="color: #000;">0</span><span style="color: #808080;">&#41;</span>           <span style="color: #0000FF;">AS</span> Questions
       ,<span style="color: #0000FF;">COALESCE</span><span style="color: #808080;">&#40;</span><span style="color: #FF00FF;">AVG</span><span style="color: #808080;">&#40;</span>p.<span style="color: #202020;">Score</span> <span style="color: #808080;">*</span> <span style="color: #000;">1.00</span><span style="color: #808080;">&#41;</span>,<span style="color: #000;">0</span><span style="color: #808080;">&#41;</span>           <span style="color: #0000FF;">AS</span> AvgScore
       ,<span style="color: #0000FF;">COALESCE</span><span style="color: #808080;">&#40;</span><span style="color: #FF00FF;">AVG</span><span style="color: #808080;">&#40;</span>p.<span style="color: #202020;">ViewCount</span> <span style="color: #808080;">*</span> <span style="color: #000;">1.00</span><span style="color: #808080;">&#41;</span>,<span style="color: #000;">0</span><span style="color: #808080;">&#41;</span>       <span style="color: #0000FF;">AS</span> AvgViewCount
       ,<span style="color: #0000FF;">COALESCE</span><span style="color: #808080;">&#40;</span><span style="color: #FF00FF;">COUNT</span><span style="color: #808080;">&#40;</span><span style="color: #0000FF;">DISTINCT</span> p.<span style="color: #202020;">OwnerUserId</span><span style="color: #808080;">&#41;</span>,<span style="color: #000;">0</span><span style="color: #808080;">&#41;</span> <span style="color: #0000FF;">AS</span> DistinctQuestioners
       ,<span style="color: #0000FF;">COALESCE</span><span style="color: #808080;">&#40;</span><span style="color: #FF00FF;">AVG</span><span style="color: #808080;">&#40;</span>p.<span style="color: #202020;">AnswerCount</span> <span style="color: #808080;">*</span> <span style="color: #000;">1.00</span><span style="color: #808080;">&#41;</span>,<span style="color: #000;">0</span><span style="color: #808080;">&#41;</span>     <span style="color: #0000FF;">AS</span> AvgAnswerCount
       ,<span style="color: #0000FF;">COALESCE</span><span style="color: #808080;">&#40;</span><span style="color: #FF00FF;">AVG</span><span style="color: #808080;">&#40;</span>p.<span style="color: #202020;">CommentCount</span> <span style="color: #808080;">*</span> <span style="color: #000;">1.00</span><span style="color: #808080;">&#41;</span>,<span style="color: #000;">0</span><span style="color: #808080;">&#41;</span>    <span style="color: #0000FF;">AS</span> AvgCommentCount
       ,<span style="color: #0000FF;">COALESCE</span><span style="color: #808080;">&#40;</span><span style="color: #FF00FF;">AVG</span><span style="color: #808080;">&#40;</span>p.<span style="color: #202020;">FavoriteCount</span> <span style="color: #808080;">*</span> <span style="color: #000;">1.00</span><span style="color: #808080;">&#41;</span>,<span style="color: #000;">0</span><span style="color: #808080;">&#41;</span>   <span style="color: #0000FF;">AS</span> AvgFavoriteCount
       ,<span style="color: #0000FF;">COALESCE</span><span style="color: #808080;">&#40;</span><span style="color: #FF00FF;">COUNT</span><span style="color: #808080;">&#40;</span>ClosedDate<span style="color: #808080;">&#41;</span>,<span style="color: #000;">0</span><span style="color: #808080;">&#41;</span>             <span style="color: #0000FF;">AS</span> ClosedQuestions
       ,<span style="color: #0000FF;">COALESCE</span><span style="color: #808080;">&#40;</span><span style="color: #FF00FF;">AVG</span><span style="color: #808080;">&#40;</span>u.<span style="color: #202020;">Reputation</span> <span style="color: #808080;">*</span> <span style="color: #000;">1.00</span><span style="color: #808080;">&#41;</span>,<span style="color: #000;">0</span><span style="color: #808080;">&#41;</span>      <span style="color: #0000FF;">AS</span> AvgQuestionerReputation
       ,<span style="color: #0000FF;">COALESCE</span><span style="color: #808080;">&#40;</span><span style="color: #FF00FF;">AVG</span><span style="color: #808080;">&#40;</span>u.<span style="color: #202020;">Age</span> <span style="color: #808080;">*</span> <span style="color: #000;">1.00</span><span style="color: #808080;">&#41;</span>,<span style="color: #000;">0</span><span style="color: #808080;">&#41;</span>             <span style="color: #0000FF;">AS</span> AvgQuestionerAge
       ,<span style="color: #0000FF;">COALESCE</span><span style="color: #808080;">&#40;</span><span style="color: #FF00FF;">AVG</span><span style="color: #808080;">&#40;</span>u.<span style="color: #202020;">UpVotes</span> <span style="color: #808080;">*</span> <span style="color: #000;">1.00</span><span style="color: #808080;">&#41;</span>,<span style="color: #000;">0</span><span style="color: #808080;">&#41;</span>         <span style="color: #0000FF;">AS</span> AvgQuestionerUpVotes
       ,<span style="color: #0000FF;">COALESCE</span><span style="color: #808080;">&#40;</span><span style="color: #FF00FF;">AVG</span><span style="color: #808080;">&#40;</span>u.<span style="color: #202020;">DownVotes</span> <span style="color: #808080;">*</span> <span style="color: #000;">1.00</span><span style="color: #808080;">&#41;</span>,<span style="color: #000;">0</span><span style="color: #808080;">&#41;</span>       <span style="color: #0000FF;">AS</span> AvgQuestionerDownVotes
<span style="color: #0000FF;">FROM</span>   dbo.<span style="color: #202020;">Posts</span> p
       <span style="color: #0000FF;">INNER</span> <span style="color: #808080;">JOIN</span> dbo.<span style="color: #202020;">Users</span> u
         <span style="color: #0000FF;">ON</span> p.<span style="color: #202020;">OwnerUserId</span> <span style="color: #808080;">=</span> u.<span style="color: #202020;">Id</span>
<span style="color: #0000FF;">WHERE</span> p.<span style="color: #202020;">Tags</span> <span style="color: #0000FF;">IS</span> <span style="color: #808080;">NOT</span> <span style="color: #808080;">NULL</span></pre></div></div>

<p>And some of the results are:</p>
<ul>
<li>Questions &#8211; 176,137</li>
<li>Average Score &#8211; 1.89</li>
<li>Average View Count &#8211; 311</li>
<li>Distinct Questioners &#8211; 39,795 (meaning anyone who has asked a single question has asked an average of 4.4 questions &#8211; there may be some odd stuff in here around anonymous questions though, haven&#8217;t looked at that yet)</li>
<li>Average Answer Count &#8211; 4</li>
<li>Average Comment Count &#8211; 2.3</li>
<li>Closed Questions &#8211; 3,656 (or 2% of all questions)</li>
<li>Average Questioner Reputation &#8211; 1,506</li>
<li>Average Questioner Age &#8211; 30 (but remember, that&#8217;s unreliable)</li>
</ul>
<p>I&#8217;m just getting started playing with it, and I&#8217;ll have a fun new StackOverflow statistics toy available for everybody to play with in a couple of days.  In the meantime, you can <a href="http://blog.stackoverflow.com/2009/06/stack-overflow-creative-commons-data-dump/">download the StackOverflow database dump via BitTorrent</a> and <a href="http://cached.brentozar.com/Import_StackOverflow_XML.zip">download my ETL stored procs</a>.</p>
<h3>Update: <a href="http://sqlserverpedia.com/wiki/Data_Mining_the_StackOverflow_Database">Sample StackOverflow Queries in the SQLServerPedia Wiki</a></h3>
<p><a href="http://blog.stackoverflow.com/2009/06/stack-overflow-creative-commons-data-dump/#comment-24381">Jon Skeet had an excellent idea</a>: we need a wiki to store interesting queries.  Wouldn&#8217;t you know, I happen to run one!  I added a <a href="http://sqlserverpedia.com/wiki/Data_Mining_the_StackOverflow_Database">section in SQLServerPedia for sample StackOverflow database queries</a>.</p>
<p>...<br /><i>Upcoming free webcasts: <a href="https://brentozarevents.webex.com/brentozarevents/onstage/g.php?t=a&d=663314175">SQL and SSDs: A Valentine's Day Love Story</a> and <a href="https://brentozarevents.webex.com/brentozarevents/onstage/g.php?t=a&d=664876357">Anatomy of the SQL Server Log File</a></i>.</p>
<div class="wp-about-author-containter-top" style="background-color:#FFEAA8;"><div class="wp-about-author-pic"><img alt='' src='http://1.gravatar.com/avatar/77f776c2eaf0cc691e8a0880bb8a191f?s=100&amp;d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D100&amp;r=R' class='avatar avatar-100 photo' height='100' width='100' /></div><div class="wp-about-author-text"><h3><a href='http://www.brentozar.com/archive/author/BrentO/' title='Brent Ozar'>Brent Ozar</a></h3><p>Brent specializes in performance tuning for SQL Server, VMware, and storage.  He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP.  He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person.  <a href="http://www.brentozar.com/consultants/brent-ozar/">Read more and contact Brent</a>.</p><p><a href='http://www.brentozar.com' title='Brent Ozar'>Website</a> - <a href='http://twitter.com/brento' title='Brent Ozaron Twitter'>Twitter</a> - <a href='http://www.facebook.com/brentozar' title='Brent Ozar on Facebook'>Facebook</a> - <a href='http://www.brentozar.com/archive/author/BrentO/' title='More posts by Brent Ozar'>More Posts</a> </p></div></div>]]></content:encoded>
			<wfw:commentRss>http://www.brentozar.com/archive/2009/06/how-to-import-the-stackoverflow-xml-into-sql-server/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>StackOverflow Data Mining: Cleansing the Data</title>
		<link>http://www.brentozar.com/archive/2009/06/stackoverflow-data-mining-cleansing-the-data/</link>
		<comments>http://www.brentozar.com/archive/2009/06/stackoverflow-data-mining-cleansing-the-data/#comments</comments>
		<pubDate>Sat, 06 Jun 2009 19:56:41 +0000</pubDate>
		<dc:creator>Brent Ozar</dc:creator>
				<category><![CDATA[SQL Server]]></category>
		<category><![CDATA[SQLServerPedia Syndication]]></category>
		<category><![CDATA[datamining]]></category>
		<category><![CDATA[stackoverflow]]></category>

		<guid isPermaLink="false">http://www.brentozar.com/?p=4119</guid>
		<description><![CDATA[The first stage of mining is a dirty, ugly business. Miners don&#8217;t emerge from tunnels bearing armfuls of shiny diamonds.  They come out with filthy, misshapen rocks that might be something valuable &#8211; but might be worthless junk.  There&#8217;s no way to tell what you&#8217;ve really got until you&#8217;ve spent some time analyzing and polishing....<p>...<br /><i>Upcoming free webcasts: <a href="https://brentozarevents.webex.com/brentozarevents/onstage/g.php?t=a&d=663314175">SQL and SSDs: A Valentine's Day Love Story</a> and <a href="https://brentozarevents.webex.com/brentozarevents/onstage/g.php?t=a&d=664876357">Anatomy of the SQL Server Log File</a></i>.</p>
]]></description>
			<content:encoded><![CDATA[<p>The first stage of mining is a dirty, ugly business.</p>
<div id="attachment_4122" class="wp-caption alignright" style="width: 310px;  border: 1px solid #dddddd; background-color: #f3f3f3; padding-top: 4px; margin: 10px; text-align:center; float: right;"><a href="http://www.flickr.com/photos/vivid_pixel/2657790692/"><img class="size-medium wp-image-4122" title="mine" src="http://cached.brentozar.com/wp-content/uploads/2009/06/mine-300x284.jpg" alt="My Datacenter" width="300" height="284" /></a><p style=' padding: 0 4px 5px; margin: 0;'  class="wp-caption-text">My Datacenter</p></div>
<p>Miners don&#8217;t emerge from tunnels bearing armfuls of shiny diamonds.  They come out with filthy, misshapen rocks that might be something valuable &#8211; but might be worthless junk.  There&#8217;s no way to tell what you&#8217;ve really got until you&#8217;ve spent some time analyzing and polishing.</p>
<p>Take one of my early findings in the <a href="http://blog.stackoverflow.com/2009/06/stack-overflow-creative-commons-data-dump/">StackOverflow database export</a>: the average age of StackOverflow users is 31, but in May, the average age of the person asking a <a href="http://stackoverflow.com/questions/tagged/hook">question tagged &#8220;hook&#8221;</a> was 59.  That&#8217;s a serious deviation.  At the other end of the scale, people asking <a href="http://stackoverflow.com/questions/tagged/ec2">questions tagged &#8220;ec2&#8243;</a> had an average age of, uh, zero.  While there is the possibility that <a href="http://twitter.com/rockhardawesome">RockhardAwesome</a> is hard at work building virtual machines in Amazon Ec2, I&#8217;m voting that one down.</p>
<p>That&#8217;s what I get for jumping into mining without cleaning off my rocks first.</p>
<p>Out of the 86,110 users in the database export, only 22,747 provided their age &#8211; and the key phrase is &#8220;provided their age.&#8221;  You can&#8217;t trust any data you get from human beings, especially these particular folks:</p>
<p><a href="http://stackoverflow.com/users/522">Ed</a> &#8211; Age 256<br />
<a href="http://stackoverflow.com/users/103">svec</a> &#8211; Age 109<br />
<a href="http://stackoverflow.com/users/159">deuseldorf</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/90">Coding the Wheel</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/730">Keng</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/987">Will Dean</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/1065">kokos</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/1223">ColinYounger</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/1242">Lars Truijens</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/13293">dydx</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/14173">Confused Computer Guy</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/14456">Ian Kelling</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/14569">davr</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/15162">Smirking Liberal</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/16005">Sam Meldrum</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/17007">DrStalker</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/17746">Frans</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/17826">Mark Bessey</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/18747">Tony Andrews</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/20161">Pat</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/21677">J-P</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/24039">Simon</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/2031">danb</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/2112">dhislop</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/2590">Matt Rogish</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/3166">Josh</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/3790">pozdziemny</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/3810">chinna</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/4668">Alan Storm</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/4681">Joseph Ducreux</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/4737">jamesh</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/5505">toobstar</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/5964">markd</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/6682">Atif Aziz</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/9360">Peter Boughton</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/10278">que que</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/10492">DJ</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/10631">Cliff</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/11339">gaoshan88</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/34831">King Avitus</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/35362">alden</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/37843">Alan</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/38264">yx</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/38613">ElephantMoss</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/39057">Loki</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/44481">Tautologistics</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/47522">Alkini</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/48906">h_power11</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/49153">Click Upvote</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/50548">Salty</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/51071">Sean James</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/53363">kenneth</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/56843">ysangkok</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/57461">Pod</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/61992">Edward</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/62539">MedicineMan</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/62596">Heikki Toivonen</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/62858">Stuart</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/62921">ForceMagic</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/63994">Jane Sales</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/67510">hanesjw</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/67816">xx</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/28421">Silfheed</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/29838">noob source</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/73439">Snickers</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/76114">davefb</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/79749">markti</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/82216">sampablokuper</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/83089">afitzpatrick</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/85688">mishac</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/87280">Computer Security</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/87716">oofoe</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/88404">Tyler Egeto</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/88806">jeffa00</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/91410">Nikola Jevtic</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/95664">Dave</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/103850">monkeysword</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/105760">wowus</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/106801">sgargan</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/108671">saidireddy</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/110818">Bobby Fever</a> &#8211; Age 89<br />
<a href="http://stackoverflow.com/users/103574">Zaakk</a> &#8211; Age 88<br />
<a href="http://stackoverflow.com/users/73521">Gary</a> &#8211; Age 88<br />
<a href="http://stackoverflow.com/users/32483">rlb.usa</a> &#8211; Age 88<br />
<a href="http://stackoverflow.com/users/72022">tan</a> &#8211; Age 88<br />
<a href="http://stackoverflow.com/users/63295">Kieranmaine</a> &#8211; Age 88<br />
<a href="http://stackoverflow.com/users/58135">Ainab</a> &#8211; Age 88<br />
<a href="http://stackoverflow.com/users/13002">Sleep Deprivation Ninja</a> &#8211; Age 88<br />
<a href="http://stackoverflow.com/users/11438">joelhardi</a> &#8211; Age 87<br />
<a href="http://stackoverflow.com/users/52389">Simon H</a> &#8211; Age 86<br />
<a href="http://stackoverflow.com/users/99880">Nick Hildebrant</a> &#8211; Age 86<br />
<a href="http://stackoverflow.com/users/1464">alanl</a> &#8211; Age 84<br />
<a href="http://stackoverflow.com/users/93897">Dustin</a> &#8211; Age 81<br />
<a href="http://stackoverflow.com/users/74815">jeffamaphone</a> &#8211; Age 80<br />
<a href="http://stackoverflow.com/users/98038">molf</a> &#8211; Age 80</p>
<p>I applaud these folks for their civil disobedience, and curse them for same.  There&#8217;s an interesting underlying correlation: people who ask questions about hooks seem to be more likely to lie about their age.  I&#8217;ll leave that as an exercise for the reader.</p>
<p>On the bright side, I&#8217;ve found some other interesting bits of data, although these are still very much rocks that haven&#8217;t been cleansed yet:</p>
<ul>
<li><a href="http://stackoverflow.com/questions/tagged/beginner">Questions tagged beginner</a> get significantly higher upvotes than other questions (avg 391, sitewide avg 120), which might indicate that if you wanted an upvoted question, write one for beginners.</li>
<li>Questions tagged <a href="http://stackoverflow.com/questions/tagged/routing">routing</a>, <a href="http://stackoverflow.com/questions/tagged/resources">resources</a>, <a href="http://stackoverflow.com/questions/tagged/video">video</a>, <a href="http://stackoverflow.com/questions/tagged/programming">programming</a> or <a href="http://stackoverflow.com/questions/tagged/google">google</a> are favorited more than twice as often as the average.</li>
<li>Questions tagged <a href="http://stackoverflow.com/questions/tagged/svn">svn</a> are asked by people who do more downvoting than other users (avg 18, sitewide avg 10).  Conversely, questions tagged <a href="http://stackoverflow.com/questions/tagged/vim">vim</a> or <a href="http://stackoverflow.com/questions/tagged/interop">interop</a> are asked by people who do more upvoting (avg 324 and 303, sitewide avg 119.)</li>
<li>Questions tagged <a href="http://stackoverflow.com/questions/tagged/homework">homework</a> are asked by younger users (avg age 24, sitewide question avg 29).  Makes sense.</li>
</ul>
<p>I&#8217;ll dig more into this tomorrow, but now I&#8217;m off to see <a href="http://www.brentozar.com/what-i-do/my-family/my-dad/">my dad</a> to celebrate his 60th birthday.  Hmmm &#8211; you know, come to think of it, I haven&#8217;t actually seen his driver&#8217;s license&#8230;</p>
<p>...<br /><i>Upcoming free webcasts: <a href="https://brentozarevents.webex.com/brentozarevents/onstage/g.php?t=a&d=663314175">SQL and SSDs: A Valentine's Day Love Story</a> and <a href="https://brentozarevents.webex.com/brentozarevents/onstage/g.php?t=a&d=664876357">Anatomy of the SQL Server Log File</a></i>.</p>
<div class="wp-about-author-containter-top" style="background-color:#FFEAA8;"><div class="wp-about-author-pic"><img alt='' src='http://1.gravatar.com/avatar/77f776c2eaf0cc691e8a0880bb8a191f?s=100&amp;d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D100&amp;r=R' class='avatar avatar-100 photo' height='100' width='100' /></div><div class="wp-about-author-text"><h3><a href='http://www.brentozar.com/archive/author/BrentO/' title='Brent Ozar'>Brent Ozar</a></h3><p>Brent specializes in performance tuning for SQL Server, VMware, and storage.  He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP.  He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person.  <a href="http://www.brentozar.com/consultants/brent-ozar/">Read more and contact Brent</a>.</p><p><a href='http://www.brentozar.com' title='Brent Ozar'>Website</a> - <a href='http://twitter.com/brento' title='Brent Ozaron Twitter'>Twitter</a> - <a href='http://www.facebook.com/brentozar' title='Brent Ozar on Facebook'>Facebook</a> - <a href='http://www.brentozar.com/archive/author/BrentO/' title='More posts by Brent Ozar'>More Posts</a> </p></div></div>]]></content:encoded>
			<wfw:commentRss>http://www.brentozar.com/archive/2009/06/stackoverflow-data-mining-cleansing-the-data/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Data Mining the StackOverflow Database</title>
		<link>http://www.brentozar.com/archive/2009/06/data-mining-the-stackoverflow-database/</link>
		<comments>http://www.brentozar.com/archive/2009/06/data-mining-the-stackoverflow-database/#comments</comments>
		<pubDate>Thu, 04 Jun 2009 14:57:09 +0000</pubDate>
		<dc:creator>Brent Ozar</dc:creator>
				<category><![CDATA[SQL Server]]></category>
		<category><![CDATA[datamining]]></category>
		<category><![CDATA[ssas]]></category>
		<category><![CDATA[stackoverflow]]></category>

		<guid isPermaLink="false">http://www.brentozar.com/?p=4071</guid>
		<description><![CDATA[StackOverflow released a public dump of their database this morning. Jeff Atwood and the guys believe that if you, the community, are putting the work into this huge body of knowledge, then you should be able to have rights to use it. This is a great dataset to show off one of my favorite toys...<p>...<br /><i>Upcoming free webcasts: <a href="https://brentozarevents.webex.com/brentozarevents/onstage/g.php?t=a&d=663314175">SQL and SSDs: A Valentine's Day Love Story</a> and <a href="https://brentozarevents.webex.com/brentozarevents/onstage/g.php?t=a&d=664876357">Anatomy of the SQL Server Log File</a></i>.</p>
]]></description>
			<content:encoded><![CDATA[<p><a href="http://blog.stackoverflow.com/2009/06/stack-overflow-creative-commons-data-dump/">StackOverflow released a public dump of their database</a> this morning.  Jeff Atwood and the guys believe that if you, the community, are putting the work into this huge body of knowledge, then you should be able to have rights to use it.</p>
<p>This is a great dataset to show off one of my favorite toys from the <a href="http://www.sqlserverdatamining.com/ssdm/">Microsoft SQL Server Data Mining team</a>.  In this half-hour video, Tom LaRock and I will walk you through data mining the StackOverflow user list to find out more about the users and see what makes the rockstar high-reputation users different from the worker bees like me.</p>
<p><object type="application/x-shockwave-flash" data="http://vimeo.com/moogaloop.swf" width="480" height="256"><param name="allowscriptaccess" value="always"/><param name="allowfullscreen" value="true"/><param name="movie" value="http://vimeo.com/moogaloop.swf"/><param name="flashvars" value="clip_id=10729965&amp;color=00adef&amp;fullscreen=1&amp;server=vimeo.com&amp;show_byline=1&amp;show_portrait=1&amp;show_title=1"/></object></p>
<p>If this looks interesting to you, here&#8217;s what else I&#8217;ve been doing with the StackOverflow data:</p>
<ul>
<li><a href="http://sqlserverpedia.com/wiki/Understanding_the_StackOverflow_Database_Schema">StackOverflow Database Schema article</a> at SQLServerPedia, the wiki I manage</li>
<li><a href="http://sqlserverpedia.com/wiki/How_to_Import_the_StackOverflow_XML_into_SQL_Server">How to import the StackOverflow XML files into Microsoft SQL Server</a></li>
</ul>
<p>Now, back to what I did in the video &#8211; let&#8217;s talk about the tools I used.</p>
<h3>Microsoft&#8217;s Free Data Mining Tools</h3>
<p>For today&#8217;s demo, I&#8217;m using SQL Server Analysis Services installed on my desktop.  Relax &#8211; it&#8217;s really easy.  Literally just install SQL Server 2005 or 2008 Developer Edition, check the box for Analysis Services, and use the defaults.  You don&#8217;t have to know what you&#8217;re doing in order to get it up and running, and it just runs in the background as a service.  After you&#8217;re done playing around, you can stop the service and set it to manual to prevent it from sapping your system resources.  Go into Control Panel, Administrative Tools, double-click on the SQL Server Analysis Services service, and change the startup type to Manual.</p>
<p>Depending on your version of SQL Server and Excel, you&#8217;ll need one of these free plugins from Microsoft:</p>
<ul>
<li><a href="http://www.microsoft.com/DOWNLOADS/details.aspx?familyid=7C76E8DF-8674-4C3B-A99B-55B17F3C4C51&amp;displaylang=en">Microsoft SQL Server 2005 Data Mining Add-Ins for Office 2007</a></li>
<li><a href="http://www.microsoft.com/sqlserver/2008/en/us/data-mining-addins.aspx">Microsoft SQL Server 2008 Data Mining Add-Ins for Office 2007</a></li>
<li>And you can <a href="http://tutorials.sqlserverpedia.com/SQLServerPedia-20090604-DataMining.zip">download the StackOverflow users spreadsheet shown in the video</a></li>
</ul>
<p>If you want to avoid the whole SQL Server Analysis Services thing altogether, you can also use Microsoft&#8217;s <a href="http://www.sqlserverdatamining.com/cloud/">free SQL Server Data Mining in the Cloud plugin</a>.  Be aware that it&#8217;s a technical preview, not a fully supported &amp; released product.  Their cloud servers can (and do) go down.  Also know that your data is going into the cloud, which has its own ramifications as I&#8217;ve discussed in <a href="http://sqlserverpedia.com/blog/uncategorized/sql-server-data-mining-in-the-cloud/">my previous cloud data mining tutorial</a>.</p>
<h3>What&#8217;s Coming Next: SQL Server 2008 R2 with BI in Excel</h3>
<p>In the <a href="http://www.microsoft.com/sqlserver/2008/en/us/r2.aspx">next version of SQL Server</a>, Microsoft will deliver business intelligence to end users through Excel.  At the <a href="http://sqlpass.org/">Professional Association for SQL Server</a> Summit last November, Donald Farmer demoed slicing and dicing of huge spreadsheets with real-time analytics that previously would have required some pretty hefty hardware.</p>
<p>Excel 2007 has a million-row limit, but the forthcoming version will not.  Some of the StackOverflow export tables like Votes have more than a million rows, so we can&#8217;t yet data mine those using Excel as a front end, but we can play with the Users table today.</p>
<h3>Subscribing or Downloading My Podcasts</h3>
<p>If you have an MP3 player or a portable video player and you want to download my podcasts automatically, you can subscribe to the SQLServerPedia podcast feeds here:</p>
<ul>
<li><a href="http://feeds.feedburner.com/SqlserverpediaSqlServerTutorialPodcastMP4">MP4 (Apple) Video Feed</a></li>
<li><a href="http://feeds.feedburner.com/SqlserverpediaSqlServerTutorialPodcastWMV">WMV (Microsoft) Video Feed</a></li>
<li><a href="http://feeds.feedburner.com/SqlserverpediaSqlServerTutorialPodcastMP3">MP3 Audio-Only Feed</a></li>
<li><a href="zune://subscribe/?SQLServerPedia%20Video=http://feeds.feedburner.com/SqlserverpediaSqlServerTutorialPodcastMP4">Zune One-Click Subscribe for Video</a></li>
<li><a href="zune://subscribe/?SQLServerPedia%20Audio=http://feeds.feedburner.com/SqlserverpediaSqlServerTutorialPodcastMP3">Zune One-Click Subscribe for MP3</a></li>
</ul>
<p>You can also download this video to watch it later:</p>
<ul>
<li><a href="http://tutorials.sqlserverpedia.com/SQLServerPedia-20090604-DataMining.mp4">MP4 (Apple) Video Download</a></li>
<li><a href="http://tutorials.sqlserverpedia.com/SQLServerPedia-20090604-DataMining.wmv">WMV (Microsoft) Video Download</a></li>
</ul>
<p>...<br /><i>Upcoming free webcasts: <a href="https://brentozarevents.webex.com/brentozarevents/onstage/g.php?t=a&d=663314175">SQL and SSDs: A Valentine's Day Love Story</a> and <a href="https://brentozarevents.webex.com/brentozarevents/onstage/g.php?t=a&d=664876357">Anatomy of the SQL Server Log File</a></i>.</p>
<div class="wp-about-author-containter-top" style="background-color:#FFEAA8;"><div class="wp-about-author-pic"><img alt='' src='http://1.gravatar.com/avatar/77f776c2eaf0cc691e8a0880bb8a191f?s=100&amp;d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D100&amp;r=R' class='avatar avatar-100 photo' height='100' width='100' /></div><div class="wp-about-author-text"><h3><a href='http://www.brentozar.com/archive/author/BrentO/' title='Brent Ozar'>Brent Ozar</a></h3><p>Brent specializes in performance tuning for SQL Server, VMware, and storage.  He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP.  He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person.  <a href="http://www.brentozar.com/consultants/brent-ozar/">Read more and contact Brent</a>.</p><p><a href='http://www.brentozar.com' title='Brent Ozar'>Website</a> - <a href='http://twitter.com/brento' title='Brent Ozaron Twitter'>Twitter</a> - <a href='http://www.facebook.com/brentozar' title='Brent Ozar on Facebook'>Facebook</a> - <a href='http://www.brentozar.com/archive/author/BrentO/' title='More posts by Brent Ozar'>More Posts</a> </p></div></div>]]></content:encoded>
			<wfw:commentRss>http://www.brentozar.com/archive/2009/06/data-mining-the-stackoverflow-database/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
<enclosure url="http://tutorials.sqlserverpedia.com/SQLServerPedia-20090604-DataMining.mp4" length="56139922" type="video/mp4" />
<enclosure url="http://tutorials.sqlserverpedia.com/SQLServerPedia-20090604-DataMining.wmv" length="24671839" type="video/x-ms-wmv" />
		</item>
		<item>
		<title>Index Fragmentation Findings: Part 2, Size Matters</title>
		<link>http://www.brentozar.com/archive/2009/02/index-fragmentation-findings-part-2-size-matters/</link>
		<comments>http://www.brentozar.com/archive/2009/02/index-fragmentation-findings-part-2-size-matters/#comments</comments>
		<pubDate>Tue, 10 Feb 2009 14:15:00 +0000</pubDate>
		<dc:creator>Brent Ozar</dc:creator>
				<category><![CDATA[SQL Server]]></category>
		<category><![CDATA[SQLServerPedia Syndication]]></category>
		<category><![CDATA[datamining]]></category>
		<category><![CDATA[fragmentation]]></category>
		<category><![CDATA[index]]></category>

		<guid isPermaLink="false">http://www.brentozar.com/?p=2458</guid>
		<description><![CDATA[Last week, I blogged about the basics of SQL Server index fragmentation: why it happens, how to fix it, and how often people are fixing it.  I left you with a cliffhanger: it seemed that the frequency of defrag jobs didn&#8217;t appear to affect fragmentation levels: Databases with no index defragmentation were an average of...<p>...<br /><i>Upcoming free webcasts: <a href="https://brentozarevents.webex.com/brentozarevents/onstage/g.php?t=a&d=663314175">SQL and SSDs: A Valentine's Day Love Story</a> and <a href="https://brentozarevents.webex.com/brentozarevents/onstage/g.php?t=a&d=664876357">Anatomy of the SQL Server Log File</a></i>.</p>
]]></description>
			<content:encoded><![CDATA[<p>Last week, I blogged about <a href="http://www.brentozar.com/archive/2009/02/index-fragmentation-findings-part-1-the-basics/">the basics of SQL Server index fragmentation</a>: why it happens, how to fix it, and how often people are fixing it.  I left you with a cliffhanger: it seemed that the frequency of defrag jobs didn&#8217;t appear to affect fragmentation levels:</p>
<ul>
<li>Databases with no index defragmentation were an average of 5% fragmented</li>
<li>Monthly &#8211; 17% fragmented</li>
<li>Weekly &#8211; 3% fragmented</li>
<li>Daily &#8211; 6% fragmented</li>
</ul>
<p>At first glance, that would seem to indicate that your database got worse off if you defragmented! But like all good novels (and most bad ones), the plot thickens.</p>
<h3>Enter Data Mining with Excel and SQL Server</h3>
<div id="attachment_2461" class="wp-caption alignright" style="width: 310px;  border: 1px solid #dddddd; background-color: #f3f3f3; padding-top: 4px; margin: 10px; text-align:center; float: right;"><a href="http://www.flickr.com/photos/niosh/2492849496/"><img class="size-medium wp-image-2461" title="data-mining" src="http://d2me0cejidzvf9.cloudfront.net/wp-content/uploads/2009/02/data-mining-300x199.jpg" alt="Data Mining with Open Source Tools" width="300" height="199" /></a><p style=' padding: 0 4px 5px; margin: 0;'  class="wp-caption-text">Your Grandfather In His Cubicle</p></div>
<p>Data mining is a lot like diamond mining, only there&#8217;s no <a href="http://www.theatlantic.com/doc/198202/diamond">monopoly on the market</a>, and the ladies don&#8217;t seem to appreciate a quality KPI.  Otherwise, they&#8217;re identical: there&#8217;s a whole lot of money in it, but that money doesn&#8217;t usually go to the people who do the actual mining.  It goes to the executives and salespeople who take advantage of the mined products to make better decisions.</p>
<p>The people doing the mining, on the other hand, are forced to spend their lives in tiny, dark caves (or &#8220;cubicles&#8221;) trying to extract beautiful gems (or &#8220;data&#8221;) while risking painful lung ailments (or &#8220;carpal tunnel&#8221;) due to toiling with terribly unsafe and outdated hardware (or &#8220;hardware&#8221;).</p>
<p>For today&#8217;s demo, I will be the miner, and you&#8217;ll be the executive who takes advantage of my work. (It&#8217;s okay, I&#8217;m used to it &#8211; I work for a vendor now.)</p>
<p>In my podcast <a href="http://sqlserverpedia.com/wiki/Data_Mining_with_Excel">Data Mining with Excel in Four Minutes</a>, I explained how to set up Microsoft&#8217;s free data mining add-ins for Excel 2007.  It&#8217;s an Excel plugin that hooks up to any SQL Server Analysis Services server on your network, either SQL Server 2005 or 2008, and makes data mining a point-and-click affair.  It doesn&#8217;t require high-end horsepower &#8211; even a desktop or laptop works great for this.  If you can&#8217;t be bothered to set up an SSAS instance, then check out my <a href="http://sqlserverpedia.com/blog/analysis-services/sql-server-data-mining-in-the-cloud/">Data Mining in the Cloud</a> writeup on how to get started without using a server at all.</p>
<h3>Help SSAS Help You: Explain Your Numbers</h3>
<p>While data mining is really easy to set up, you can get much better results if you &#8220;prequalify&#8221; your data and turn some of the numbers into basic categories.</p>
<p>If I was working with United States salary data, for example, my source data might have a column for Hourly Wage.  I would add another column and call it Tipped Employees:</p>
<ul>
<li>Under $6.55 per hour &#8211; Tipped Employees  = Yes.  You can pay someone less than minimum wage if they get tips, and in that case, you really just can&#8217;t go by their hourly wage alone.</li>
<li>$6.55 per hour and over &#8211; Tipped Employees = &#8220;Unknown.&#8221;  In a perfect world, I&#8217;d have enough data to find out if these people get tips, but that&#8217;s not always the case.</li>
</ul>
<p>By adding a new attribute to my data, something that&#8217;s not clear from the numbers alone, I might get better insight from my data mining efforts.</p>
<p>By the way, if you&#8217;re reading this and it&#8217;s after July 2009, the minimum wage has risen to $7.25 per hour.  If you&#8217;re a VB developer, you should immediately ask for a pay increase to match the new standard &#8211; unless of course they&#8217;ve got a tip jar by your desk.</p>
<h3>Explaining Our Index Fragmentation Numbers</h3>
<p>In the case of our index fragmentation numbers, one of the source data fields is Page Count &#8211; the number of pages that an object has.  Size matters with fragmentation: small objects with only a handful of pages may appear to have very high fragmentation numbers, but they can&#8217;t actually be defragmented.  There&#8217;s only so much defragmentation you can do when a table only has three pages.  I&#8217;ve actually been on support escalation calls where customers demand to know why a defrag job doesn&#8217;t reduce all types of fragmentation to absolute zero, even for tables with just one page.</p>
<p><a href="http://technet.microsoft.com/en-us/library/cc966523.aspx">Microsoft&#8217;s best practices on SQL Server 2000 index defragmentation</a> notes that:</p>
<p style="padding-left: 30px;"><em>&#8220;Generally, you should not be concerned with fragmentation levels of indexes with less than 1,000 pages. In the tests, indexes containing more than 10,000 pages realized performance gains, with the biggest gains on indexes with significantly more pages (greater than 50,000 pages).&#8221;</em></p>
<p>With that in mind, I added a Page Count Group column and calculated it with a formula:</p>
<pre>=IF(Table1[[#This Row],[page_count]]&gt;50000,"Large",(IF(Table1[[#This Row],[page_count]]&lt;10000,"Small","Medium")))</pre>
<p>That adds a text label for Small, Medium or Large depending on the size of the table.</p>
<h3>Suddenly, The Data Makes More Sense</h3>
<div id="attachment_2472" class="wp-caption alignleft" style="width: 357px;  border: 1px solid #dddddd; background-color: #f3f3f3; padding-top: 4px; margin: 10px; text-align:center; float: left;"><img class="size-full wp-image-2472" title="sql-server-fragmentation-pivot-table" src="http://d2me0cejidzvf9.cloudfront.net/wp-content/uploads/2009/02/sql-server-fragmentation-pivot-table.png" alt="Fragmentation Pivot Table" width="347" height="145" /><p style=' padding: 0 4px 5px; margin: 0;'  class="wp-caption-text">Fragmentation Pivot Table</p></div>
<p>Even before doing data mining, if we just add a Pivot Table, we can suddenly make more sense out of the numbers.</p>
<p>For Large tables, we see an average 44% fragmentation when the database has no defragmentation jobs set up.  Monthly defrag drops that to 14%, and daily drops it to just 2%!  The Weekly data is a bit of an outlier here, but it&#8217;s still less than no defrag jobs at all, so we&#8217;ll have to dig deeper.</p>
<p>For Medium tables, we see the type of data distribution we would hope for: the more often we defrag, the lower our fragmentation gets.</p>
<p>For Small tables, the data is all over the place, but we know why: it has to do with the way smaller tables behave.</p>
<p>Adding this bit of human interpretation helped us get better results from our data &#8211; and we haven&#8217;t even started mining!</p>
<h3>More Reading on SQL Server Fragmentation</h3>
<p>If you liked this article, check out:</p>
<ul>
<li><a href="http://www.quest.com/events/listdetails.aspx?contentid=9230&amp;technology=34&amp;prod=&amp;prodfamily=&amp;loc=">Webcast on SQL Server index fragmentation with Michelle Ufford</a></li>
<li> <a href="http://sqlfool.com/tag/defrag/">Michelle Ufford&#8217;s latest articles on fragmentation</a></li>
<li><a href="http://sqlserverpedia.com/wiki/Index_Maintenance">SQLServerPedia&#8217;s free index defrag scripts</a></li>
</ul>
<p>...<br /><i>Upcoming free webcasts: <a href="https://brentozarevents.webex.com/brentozarevents/onstage/g.php?t=a&d=663314175">SQL and SSDs: A Valentine's Day Love Story</a> and <a href="https://brentozarevents.webex.com/brentozarevents/onstage/g.php?t=a&d=664876357">Anatomy of the SQL Server Log File</a></i>.</p>
<div class="wp-about-author-containter-top" style="background-color:#FFEAA8;"><div class="wp-about-author-pic"><img alt='' src='http://1.gravatar.com/avatar/77f776c2eaf0cc691e8a0880bb8a191f?s=100&amp;d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D100&amp;r=R' class='avatar avatar-100 photo' height='100' width='100' /></div><div class="wp-about-author-text"><h3><a href='http://www.brentozar.com/archive/author/BrentO/' title='Brent Ozar'>Brent Ozar</a></h3><p>Brent specializes in performance tuning for SQL Server, VMware, and storage.  He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP.  He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person.  <a href="http://www.brentozar.com/consultants/brent-ozar/">Read more and contact Brent</a>.</p><p><a href='http://www.brentozar.com' title='Brent Ozar'>Website</a> - <a href='http://twitter.com/brento' title='Brent Ozaron Twitter'>Twitter</a> - <a href='http://www.facebook.com/brentozar' title='Brent Ozar on Facebook'>Facebook</a> - <a href='http://www.brentozar.com/archive/author/BrentO/' title='More posts by Brent Ozar'>More Posts</a> </p></div></div>]]></content:encoded>
			<wfw:commentRss>http://www.brentozar.com/archive/2009/02/index-fragmentation-findings-part-2-size-matters/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>

<!-- Performance optimized by W3 Total Cache. Learn more: http://www.w3-edge.com/wordpress-plugins/

Page Caching using disk: enhanced
Object Caching 1248/1288 objects using disk: basic

Served from: www.brentozar.com @ 2012-02-08 17:01:12 -->
