Quae Nocent Docent

What hurts, teaches – Ordinary tales from management trenches

Archive for March 2009

KB 958490 – KB article updated

without comments

Now it correctly reports MOMAgentInstaller.exe as one of the file modified by the fix.

– Daniele

This posting is provided "AS IS" with no warranties, and confers no rights.

Written by Daniele Grandini

March 30, 2009 at 10:52 am

Posted in KB, SCOM

SCOM Patching blues – part 2

without comments

In my previous post on the theme I pointed out two issues with KB 956689 and KB 958490. For the former I stated several files are updated by the fix even if they’re not cited in the KB article and without changing the version number. Daniele Muscetta (msft) correctly addressed me to this msdn article that states that by default:

1) versioned files are changed just when a new version is present in the msp file

2) unversioned files are changed based on checksum

Now, I gave a quick answer to Daniele in the comment section, but I understand I need to articulate further my statement. On one side the keyword here is “by default” that means if no special parameters are passed to msiexec, on the other the article explains other cases when a versioned file can be substituted with a different file with the same version. So, once the file is in the msp, the exact behavior depends on the way msiexec is called. Adhering to Murphy’s law here’s a screenshot from a system where I applied the fix:

clip_image001

Taking a closer look to MOMModules.dll we can see it’s still at version 6.0.6278.0 but it’s not the SP1 RTM binary and the change date is the same of MOMNetworkModules.dll version .41 (the dll changed by our very own fix). So the fix changes more files than documented. There can be valid reasons to add more files than the one meant to be patched in a msp, for example dependency relationships with specific modules the dev wants to be sure are maintained, but, immo, there are no valid reasons to have different versions of the same file with the same version number. And this is exactly what happens in this fix. Anyway supposing the only file that needs to be updated by the fix is MOMNetworkModules now the fix is obsoleted, so just do not apply it.

On the case of KB 958490, that caused in my environment a huge increase in CPU utilization agent side, I want to add some information:

  • probably this behavior is not a rule, it depends on the mix of MPs implemented
  • on a customer of ours we tried the fix and immediately we had an increase on DCs but apparently not on other servers, we were in production and kept the fix just a few hours before uninstalling it
  • I know the MOM team is working hard on a repro without any luck as of today
  • So if you have the dependency rollup issue solved by the fix (and you have it, trust me :-)) you can probably give it a try and check the HealthService CPU utilization on your agents. If you have any experience to share I’m interested in listening.

– Daniele

This posting is provided "AS IS" with no warranties, and confers no rights.

Technorati Tags: ,

Written by Daniele Grandini

March 26, 2009 at 1:21 am

Posted in SCOM

Troubleshooting Data warehouse arithmetic overflows

without comments

In the last weeks, in the wage of R2 RC release, I decided it was time to review all the MPs we developed to prepare an action plan for improvement. I have great expectations for R2.

So I started to revamp our development Lab adding missing MPs and updating obsolete ones. Soon I started to get error 31553 and my state reports started to mess up.

Error 31553 is pretty generic, my own was referring to an arithmetic overflow:

Event Type:            Error
Event Source:          Health Service Modules
Event Category:        Data Warehouse 
Event ID:              31553
Date:                  3/6/2009
Time:                  6:51:41 PM
User:                  N/A
Computer:              SCOM-MS1
Description:
 
Data was written to the Data Warehouse staging area but processing failed on one of the subsequent operations.
Exception 'SqlException': Sql execution failed. Error 8115, Level 16, State 2, Procedure ManagedEntityChange, Line 157, Message: Arithmetic overflow error converting expression to data type int. 
One or more workflows were affected by this.
Workflow name: Microsoft.SystemCenter.DataWarehouse.Synchronization.ManagedEntity

I knew the DW has a staging process so that new data is staged in tiny and fast tables and then inserted in the proper ones, this way the insertion completes quickly and the a background process takes care of the proper consolidation. In this case the table is dbo.ManagedEntityStage. Opening the table with SQL Server Studio returned too much rows for a transit table in a very limited lab environment, but the data seemed ok. The data is queuing up that’s for sure, so something must have gone wrong in the process. I drilled down further in the staging process, the main stored procedure here is dbo.StateProcessStaging so I started to step through the SQL code to check where an arithmetic could occur, while I was in this process I typed this simple query: select * from dbo.ManagedEntityStage. Tada a certain number of rows were returned but the select stopped with an arithmetic overflows (I forgot one of my golden rules, don’t trust GUIs).

So the issue is with one or more rows in ManagedEntityStage. Time to review the table schema:

image

As it turned out the table has a computed column (InsertReadyInd), the formula is as follows

image

So here they’re multiplying the ManagedEntityTypeRowID (i.e. the unique identifier inside the DW for object classes) with the host id of the managed entity. These two are integers. SQL as an implicit type conversion for computed columns, this type conversion takes the least restrictive type of the terms involved in the formula as the field type (http://msdn.microsoft.com/en-us/library/ms163363.aspx). In our case InsertReadyInd is an integer for SQL sake. But, int * int casted to an int can lead to an arithmetic overflow, and this was exactly my case. You must consider that in our dev Lab we constantly add and remove MPs, every time an MP is added the ManagedEntityTypeRowId increments, even if that MP has been added in the past (this won’t be the case if we just updated the MPs). So probably this is not a common scenario, btw the same issue is true for other staging tables. Nevertheless it cannot be excluded that this same issue could hit a huge production management group.

Obviously, since this is a lab environment, I didn’t open a PSS incident, but rather simply I fixed the computed field. ** do not do this in production if not directed by PSS **

So I changed  the formula for InsertReadyInd to

(abs(isnull([ManagedEntityTypeRowId],(0))*CAST(isnull([TopLevelHostManagedEntityRowId],(0)) AS FLOAT)))

This did the magic and now my stale state data is flowing into the DW.

- Daniele

This posting is provided "AS IS" with no warranties, and confers no rights.

Technorati Tags: ,,

Written by Daniele Grandini

March 23, 2009 at 6:13 pm

Posted in Bug, Data Warehouse, SCOM

The downside of on demand detections

with 3 comments

Very few management packs implement on demand detections for monitors. On demand detections are the semantics by which the “Recalculate Health” or “Reset Health” actions work.

image

Alas the two cannot be implemented together, I can have a single on demand (and in this case I can use Reset Health) *or* multiple on demands (one for every possible monitor state, so two or three) and in this case I can have a “Recalculate Health”. More over for On Demand Detections cookdown doesn’t work. So for every single monitor target all the on demand detections are run. Bad, very bad. How much?

Take into account this example, you need to monitor 100 sql databases on a single SQL Instance. You want to have a three state monitor for, let’s say, locks count on every single DB. It is possible to write a monitor targeted at each DB (100) that runs just once for SQL instance. In a single run the monitor returns the data items for all the databases and then via filtering every single instance state is evaluated. This can be achieved thanks to the cookdown logic built into the HealthService process. But if you want to have on demand detections for the same monitor, you’ll find your data provider (let’s say a script) is going to be run number of db instances (100) * number of on demand detections (3) or 300 cscript.exe processes run at once.

Bad design that makes on demand detections very dangerous. You must consider on demand detections are run at every monitor initialization, so when the MP is deployed, the monitor is modified by an override, the agent exits maintenance mode, etc., etc. not just when the user hits the recalculate button. I hope it will be fixed in R2 (stay tuned it will be one of the first things I’ll try once RC will be out).

Technorati Tags: ,

– Daniele

This posting is provided "AS IS" with no warranties, and confers no rights.

Written by Daniele Grandini

March 12, 2009 at 7:09 pm

Posted in MP, SCOM

Patching blues – QFE 958490 pumps up HealthService CPU Usage on agents

with one comment

Quick warning to everyone reading. Today I tried KB 958490 who promised to solve a long dated annoying issue with dependency rollup monitors. Alas the cure, at least in my environment, proved to be worse than the disease. In fact as soon as I tried it in my lab and my limited production environment, the healthservice on every agent jumped up in CPU utilization. This increase is independent by the patch level of the agent (i.e. it can be patched or not, it doesn’t change the net result) and correlated to the patch level of the RMS. So as soon as you update the RMS and the MS you get the issue. The following graph on HealthService CPU Utilization is self explanatory:

image

My advice: stay away form the fix until it won’t be re-released (as usual, sigh).

If anyone’s having a different experience it’d be great to hear about.

btw the fix doesn’t fix only the dll’s reported by the KB but MOMAgentInstaller.exe as well.

image

– Daniele

Technorati Tags: ,,

This posting is provided "AS IS" with no warranties, and confers no rights.

Written by Daniele Grandini

March 12, 2009 at 6:47 pm

Posted in Bug, SCOM

SCOM patching blues

with 10 comments

This one is an old favorite of mine. I tried to explain to the team (even with flames on Connect) the importance of proper patching, alas with no luck. They had a serious issue where windows installer returned successful while patching , in reality, didn’t take place. A lot of QFEs have been silently re-released to address the issue. Moreover the patchlist field in Health Service Class (thanks to QFE 958253) now reports only the KB numbers, so now it’s possible to track QFE applied on each HealthService.

Alas the blues is for from over, due to the sequencing error in old QFE I discovered to have plenty of agents not up to date, even if the QFE is “apparently” installed. The only way to check for this issue is to get the file version of every single affected file on every single agent. To address this I developed a quick MP that collects for me this data and adds a monitor for required version for each module. If anyone is interested in this one let me know and I will check if I can share it.

image

KB articles do not report superseding of old QFE with new ones, and KB 956689 (recently re-released) is still a mystery to me, it states that just MOMNetworkModules.dll is being affected, but if you open the patch file with Orca you’ll find several files get modified without even updating the version number.

image

So, after trying for the last 4 months to address the issue with PSS and the product team, I finally decided to do my own checks and to build my list of must have hotfixes (here I’m referring just to agent QFE, not to RSM/MS or MP specific QFEs). Since this took a lot of work, I’d like to share this list with the community:

  • KB 954049
  • KB 954903
  • KB 956172
  • KB 956689 (still doubtful, since the MOMNetworkModules.dll has been superseded by 957511)
  • KB 957511

this means that your AgentManagement directory on every MS should be like this:

image

Once again hope this will help.

Update: I’m authorized to share a stripped down version of the MP I’ve been referring to Progel.HSVersion.Public.xml.

Technorati Tags: ,,

– Daniele

This posting is provided "AS IS" with no warranties, and confers no rights.

Written by Daniele Grandini

March 11, 2009 at 6:32 pm

Posted in Bug, SCOM

Using DPM to backup a MOSS farm with SQL fixed port

with one comment

Last week, a colleague of mine, called in for an issue with DPM. He was in charge of protecting a MOSS farm with DPM, as he did several times, but in this occasion content databases weren’t backed up without any explicit error.

Since this post is going to be long I anticipate the conclusions. The issue was related to MOSS being configured to use a SQL named instance with a fixed port, the issue is actually related to the fixed port, but if you have both (named instance and fixed port) you won’t get any error and this makes things confusing. So if you’re forced to use a SQL Server fixed port and you cannot use SQL endpoint mapper on port 1434, thus you hard coded the port in your MOSS configuration, and you want to use DPM to backup a MOSS farm, you should use a SQL alias to map the SQL server/instance to server/instance,port and you must use SQL server\instance (without specifying any port) in your MOSS configuration.

If you’re interested in how I get to this conclusion you can read the rest of this post.  :-)

I started from the basics, I asked for MOSS configuration via stsadm and then checked the WssCmdletsWrapperCurr log file. DPM uses a WSS wrapper to query the MOSS farm for configuration parameters and this log traces the results of this queries.

But first of all I wanted to turn on all the debug tracing for DPM. The latter task is partially documented in Diagnostic Process for Request Tracking and Error Tracing where they lack to indicate the various debugging levels, since this is standard ETL tracing it turned out the common constants can be used:

clip_image002

I turned on everything (0xFF) and started digging inside this issue.

The ouput from stsadm -o enumcontentdbs -url https://myfarm.mydom.it was like this:

   1: <Databases Count="1">
   2:   <ContentDatabase Server="MYSQLDB\MOSS,2319" Name="COntentDB_WA_Portal_1" />
   3: </Databases>

So I have just one content DB on SQL Server MYSQLDB instance MOSS listening on port 2319.

Then I digged inside WssCmdletsWrapperCurr and here I found that DPM was marking the content db as not to be backed up (BackupRequired=False):

WSSCmdlets.cs(292)            ==>IsDatabaseToBeBackedUp
WSSCmdlets.cs(257)            ==>GetConfigurationDatabaseName
WSSCmdlets.cs(1104)            Sql Instance Name = MYSQLDB\MOSS
WSSCmdlets.cs(257)            <--GetConfigurationDatabaseName
WSSCmdlets.cs(1104)            Sql Instance Name = MYSQLDB\MOSS,2319
WSSCmdlets.cs(1104)            Sql Instance Name = MYSQLDB\MOSS,2319
WSSCmdlets.cs(1104)            Sql Instance Name = MYSQLDB\MOSS,2319
WSSCmdlets.cs(1104)            Sql Instance Name = MYSQLDB\MOSS,2319
WSSCmdlets.cs(332)            BackupRequired for Component MYSQLDB\MOSS\ContentDB_WA_Portal_1 = False
WSSCmdlets.cs(292)            <--IsDatabaseToBeBackedUp

So DPM seemed to correctly enumerate the MOSS content DB, but somehow marked it as not to be backed up. Now it was time to understand when DPM considers a content db to be backed up. I didn’t find any documentation on the topic, but the log was clear enough and told me this is managed code (.cs), so I have good chances to get to the source code using Reflector without any need of a lengthy debugging session. Thanks (and thanks again) to the great Reflector I get to the source code of the wsswrapper (wsscmdlets.dll).

   1: public void IsDatabaseToBeBackedUp(string instanceName, string databaseName, out bool backupRequired)
   2: {
   3:     IDisposable disposable = Tracer._TraceFunction("WSSCmdlets.cs", 0x124, "IsDatabaseToBeBackedUp");
   4:     try
   5:     {
   6:         backupRequired = false;
   7:         if (!backupRequired)
   8:         {
   9:             string str = string.Empty;
  10:             string str2 = string.Empty;
  11:             this.GetConfigurationDatabaseName(out str, out str2);
  12:             if (string.Equals(instanceName, str, StringComparison.InvariantCultureIgnoreCase) && string.Equals(databaseName, str2, StringComparison.InvariantCultureIgnoreCase))
  13:             {
  14:                 Tracer._TraceMessage(0x10, "WSSCmdlets.cs", 0x138, @"Found Config Database {0}\{1}", new object[] { instanceName, databaseName });
  15:                 backupRequired = true;
  16:             }
  17:         }
  18:         if (!backupRequired)
  19:         {
  20:             SPWebService spWebService = SPWebService.get_AdministrationService();
  21:             backupRequired = this.IsContentDatabase(spWebService, instanceName, databaseName);
  22:         }
  23:         if (!backupRequired)
  24:         {
  25:             SPWebService service2 = SPWebService.get_ContentService();
  26:             backupRequired = this.IsContentDatabase(service2, instanceName, databaseName);
  27:         }
  28:         Tracer._TraceMessage(0x10, "WSSCmdlets.cs", 0x14c, @"BackupRequired for Component {0}\{1} = {2}", new object[] { instanceName, databaseName, (bool) backupRequired });
  29:     }
  30:     catch (Exception exception)
  31:     {
  32:         Tracer._TraceMessage(2, "WSSCmdlets.cs", 0x151, @"Caught Exception while checking if the database [{0}\{1}] requires Backup.", new object[] { instanceName, databaseName });
  33:         LogExceptionDetails(exception);
  34:         throw;
  35:     }
  36:     finally
  37:     {
  38:         if (disposable != null)
  39:         {
  40:             disposable.Dispose();
  41:         }
  42:     }
  43: }
  44:  
  45:  

And here we go, on line 11 we can find the call to GetConfigurationDatabaseName and this call returns the SQL instance name in str and the content database name in str2. OK but this is not a config db, so we must concentrate on line 21 (and 26). IsContentDatabase takes instanceName and databaseName as input  parameters, from the tracing file (line WssCmdLets.cs (332))we can assume instanceName=MYSQLDB\MOSS and databaseName=ContentDB_WA_Portal_1. Ok let’s move on, but now the log is clearer in one case we have MYSQLDB\MOSS,2319 and in the other MYSQLDB\MOSS.

Digging further this is the snippet for IsContentDatabase

   1: private bool IsContentDatabase(SPWebService spWebService, string instanceName, string databaseName)
   2: {
   3:     foreach (SPWebApplication application in spWebService.get_WebApplications())
   4:     {
   5:         foreach (SPContentDatabase database in application.get_ContentDatabases())
   6:         {
   7:             string formattedSqlInstanceName = GetFormattedSqlInstanceName(database.get_Server());
   8:             string a = database.get_Name();
   9:             if (string.Equals(formattedSqlInstanceName, instanceName, StringComparison.InvariantCultureIgnoreCase) && string.Equals(a, databaseName, StringComparison.InvariantCultureIgnoreCase))
  10:             {
  11:                 Tracer._TraceMessage(0x10, "WSSCmdlets.cs", 0x4d0, @"Found Content Database {0}\{1}", new object[] { instanceName, databaseName });
  12:                 return true;
  13:             }
  14:         }
  15:     }
  16:     return false;
  17: }
  18:  
  19:  
  20:  
  21:  

Here we can read we’re using the MOSS WebService, so we must expect the same output we got from stsadm (and indeed we do). On line 9 we find the culprit of our issue it is comparing MYSQLDB\MOSS,2319 with MYSQLDB\MOSS and they will never match, hence for DPM this is not a content db and doesn’t need to be backed up. This is an evident bug in the DPM wss wrapper that can be circumvented with the workaround proposed at the beginning of this post.

Hope this would spare time to anyone’s going to face the same situation.

- Daniele

This posting is provided "AS IS" with no warranties, and confers no rights.

Technorati Tags: ,,

Written by Daniele Grandini

March 5, 2009 at 8:18 pm

Posted in Bug, DPM