The case of the MAXSCHEDSessions valueThursday, November 27. 2008![]() In TSM you can limit the maximum number of simultaneous client sessions that can connect with the server with the 'SETOPT MAXSessions' option. Some server options can be updated dynamically (without stopping and restarting the server) by using the SETOPT command, and this is one of them. The MAXSESSIONS option is changed accordingly in dsmserv.opt for you. You specify the NUMBER of sessions, for example: 'SETOPT MAXSessions 1000'. In addition to the maximum number of sessions allowed, you can specify the number of sessions that are allowed to the server for processing scheduled operations as well. This is done through the 'Set MAXSCHedsessions' command. By limiting the number of scheduled sessions you can give yourself the insurance that, during peak hours, there are always some sessions left for administrative access, unscheduled restores, unscheduled archives, unscheduled backups, etc. You specify the PERCENTAGE of maximum scheduled sessions in relation to number of maximum sessions that is set previously, for example: 'Set MAXSCHedsessions 95'. In this case 95% of 1000 sessions is 950. TSM will calculate the value for the MAXSCHEDSessions on the fly, round it up, and change the value dynamically. A quick way to prevent any client schedules from starting at all is to set the MAXSCHedsessions parameter to a value of 0%. It is stated in the manual that 'The MAXSESSIONS parameter in the server options file determines the maximum number of total available server sessions'. Now, take a look the partial output of a 'q status' below: tsm: HANNIBAL>q st ....cut.... Maximum Sessions: 1,000 Maximum Scheduled Sessions: 1,500 ....cut.... As you're effectively prevented to set the MAXSession parameter to anything more than 100%.... tsm: HANNIBAL>set maxschedsessions 101 ANR2640E SET MAXSCHEDSESSIONS: Invalid percentage of sessions for scheduled processing - 101. ANS8001I Return code 3. ....how is it possible that the number of maximum scheduled sessions is higher then the maximum sessions available? The explanation is so simple that it is something you might not take into consideration at all. The MAXSCHedsessions value is calculated during the 'set MAXSCHedsessions' command. At that moment the maximum sessions are used to calculate the relative value. If, AFTER this calculation, the maximum session value is changed, then the maximum scheduled sessions don't change accordingly automatically. You'll need to update it manually to reflect the correct relative value again. In the example above, at some point in time, there were 2000 sessions and the MAXSCHEDsessions was set to 75% (=1500). After that, someone changed the maximum sessions value to 1000 and did not issue the 'SETOPT MAXSCHEDSessions' command to reflect the new value. The simple solution would be that TSM re-calculate the MAXSCHedsessions automatically, whenever the MAXSESSIONS parameter changes. Running expiration and 'delete filespace' simultaneously (ANR0871E)Friday, November 7. 2008During expiration the ANR0871E (Cannot find file space name for node XX, file space XX) message is encountered. In the help text output of 'h 0871' is stated that if you can't isolate the problem in your activity log, you should contact your service representative. As a possible cause a database error is mentioned. This sounds somewhat too heavy for a very simple problem. Below is an extract from the activity log during expiration: ....cut.... 10/28/2008 11:20:20 ANR4391I Expiration processing node FACE_TSMCL, filespace /u19, fsId 32, domain PDM_004_AIX, and management class MANCLASS_004_A - for ARCHIVE type files. (SESSION: 34008, PROCESS: 819) 10/28/2008 11:20:20 ANR4391I Expiration processing node FACE_TSMCL, filespace /u20, fsId 33, domain PDM_004_AIX, and management class MANCLASS_004_A - for ARCHIVE type files. (SESSION: 34008, PROCESS: 819) 10/28/2008 11:20:20 ANR4391I Expiration processing node FACE_TSMCL, filespace /u21, fsId 34, domain PDM_004_AIX, and management class MANCLASS_004_A - for ARCHIVE type files. (SESSION: 34008, PROCESS: 819) 10/28/2008 11:20:20 ANR4391I Expiration processing node FACE_TSMCL, filespace /u22, fsId 35, domain PDM_004_AIX, and management class MANCLASS_004_A - for ARCHIVE type files. (SESSION: 34008, PROCESS: 819) 10/28/2008 11:20:20 ANR4391I Expiration processing node FACE_TSMCL, filespace /u17, fsId 36, domain PDM_004_AIX, and management class DEFAULT - for BACKUP type files. ....cut.... 10/28/2008 11:20:26 ANR0871E Cannot find file space name for node 301, file space 20. (SESSION: 34008, PROCESS: 819) 10/28/2008 11:20:26 ANR4391I Expiration processing node BLACKVAN_TSMCL, filespace /u02, fsId 21, domain PDM_004_AIX, and management class MANCLASS_004_C - for ARCHIVE type files. (SESSION: 34008, PROCESS: 819) 10/28/2008 11:20:26 ANR4391I Expiration processing node BLACKVAN_TSMCL, filespace /u03, fsId 22, domain PDM_004_AIX, and management class MANCLASS_004_C - for ARCHIVE type files. (SESSION: 34008, PROCESS: 819) At 11:20:20 the expiration process is busy processing node FACE_TSMCL. At 11:20:26 expiration tries to process filespace 20, but this filespace can't be found. As a result, expiration processing of this filespace is cancelled, and expiration starts processing the next filespace for this node. The ANR0871E message returns the node number (301) and filespace number (20) of the problematic filespace. There is no actual nodename returned, so to be sure about the exact nodename you can create a temporary table and query the hidden Node ID field. This is described in more detail in the TSMBLOG.org article: "How to expire a specific node (and only just that node) and query undocumented tables at the same time?" tsm: HANNIBAL>create sqltable Nodes MyTable SQL table MyTable created to map base table Nodes. As you can see, the NODE_ALPACA is not the node with the problem (so pay close attention to this in the actlog, before hunting down the wrong node), but the Node ID 301 matches BLACKVAN_TSMCL. A 'q filespace' for this node learns that indeed there is no filespace with a FSID of 20 anymore. This means that the filespace must have been deleted. Now search for this in the activity log to find the cause of this deletion: tsm: HANNIBAL>q ac se="delete filespace" begind=-1 ....cut.... 10/28/2008 11:09:44 ANR2017I Administrator ADMIN issued command: DELETE FILESPACE BLACKVAN_TSMCL * (SESSION: 33875) 10/28/2008 11:09:44 ANR0984I Process 818 for DELETE FILESPACE started in the BACKGROUND at 11:09:44. (SESSION: 33875, PROCESS: 818) 10/28/2008 11:09:44 ANR0800I DELETE FILESPACE * for node BLACKVAN_TSMCL started as process 818. (SESSION: 33875, PROCESS: 818) 10/28/2008 11:09:44 ANR0802I DELETE FILESPACE * (backup/archive data) for node BLACKVAN_TSMCL started. (SESSION: 33875, PROCESS: 818) 10/28/2008 11:29:30 ANR0826I DELETE FILESPACE /u01 for node BLACKVAN_TSMCL encountered a transaction failure. (SESSION: 33875, PROCESS: 818) 10/28/2008 11:29:30 ANR0827I DELETE FILESPACE * will be retried for node BLACKVAN_TSMCL. (SESSION: 33875, PROCESS: 818) 10/28/2008 11:34:37 ANR0826I DELETE FILESPACE /u01 for node BLACKVAN_TSMCL encountered a transaction failure. (SESSION: 33875, PROCESS: 818) 10/28/2008 11:34:37 ANR0827I DELETE FILESPACE * will be retried for node BLACKVAN_TSMCL. (SESSION: 33875, PROCESS: 818) 10/28/2008 11:40:19 ANR4937E RECLAIM STGPOOL: Reclamation will not run - delete filespace is currently running. (SESSION: 34012) 10/28/2008 11:40:19 ANR2753I (RESCHED_RECLAMATION):will not run - delete filespace is (SESSION: 34012) 10/28/2008 11:40:19 ANR4937E RECLAIM STGPOOL: Reclamation will not run - delete filespace is currently running. (SESSION: 34012) 10/28/2008 11:40:19 ANR2753I (RESCHED_RECLAMATION):will not run - delete filespace is (SESSION: 34012) 10/28/2008 11:40:19 ANR4937E RECLAIM STGPOOL: Reclamation will not run - delete filespace is currently running. (SESSION: 34012) 10/28/2008 11:40:19 ANR2753I (RESCHED_RECLAMATION):will not run - delete filespace is (SESSION: 34012) 10/28/2008 11:42:37 ANR0806I DELETE FILESPACE * complete for node BLACKVAN_TSMCL: 662364 objects deleted. (SESSION: 33875, PROCESS: 818) 10/28/2008 11:42:37 ANR0987I Process 818 for DELETE FILESPACE running in the BACKGROUND processed 662364 items with a completion state of SUCCESS at 11:42:37. (SESSION: 33875, PROCESS: 818) Here, several things are happening at the same time. You can see that (a rare coincidence) at the same time that the expiration for FSID 20 is processed, a delete filespace for filespace '/u01' (this must be FSID20) also is triggered. Both the expiration and delete filespace encounter errors (ANR0871E and ANR0826I - transaction failure), but both processes will skip the filespace and processing continues to eventually complete successfully. In the help text output of 'h 0826' there is stated that "the transaction failure usually indicates that a database deadlock was detected during file space deletion." And this is exactly the case here. In the help text, as an example, multiple file space deletion processes are running at the same time is mentioned. This situation is nearly the same, but with an expiration process running at the same time. A deadlock situation is a situation between a number of processes attempting to access the same database information. There is a locking mechanism within TSM to handle such situations and to prevent database corruption. Indeed a little later the deadlock situation is reported in the actlog: At the end of the day, there is no problem. The filespace was deleted, so no further expiration is required for it. These errors are a result of a coincidental situation. Note the in the actlog extract above, the reclamation process can't be started, as a result of the running delete filespace and to avoid resource contention. The case of the TSM Client lock conflict (ANR0918E)Thursday, October 30. 2008During the backup of a Windows client the ANR0918E (lock conflict) message is being reported. Below is the output from the actlog for the ANR0918E message: tsm: HANNIBAL>q ac se=anr0918 begind=-1 Date/Time Message -------------------- ---------------------------------------------------------- 10/13/2008 01:32:19 ANR0918E Inventory Query Backup for node FACE_TSMCL terminated - lock conflict. (SESSION: 16678) At the same time, in the actlog the ANR0538I (A resource waiter has been aborted) informational message is displayed. 10/13/2008 01:32:19 ANR0538I A resource waiter has been aborted. 10/13/2008 01:32:19 ANR0918E Inventory Query Backup for node FACE_TSMCL terminated - lock conflict. (SESSION: 16678) 10/13/2008 01:32:19 ANR0403I Session 16678 ended for node FACE_TSMCL (WinNT). (SESSION: 16678) To find more details, the actlog is now queried for the session number (16678) of the ANR0918E message to display the entire backup session of the node FACE_TSMCL: tsm: HANNIBAL>q ac se="session: 16678" begind=-1 Date/Time Message -------------------- ---------------------------------------------------------- 10/13/2008 00:14:11 ANR0406I Session 16678 started for node FACE_TSMCL (WinNT) (Tcp/Ip face.tsmblog.org(1659)). (SESSION: 16678) 10/13/2008 01:32:19 ANR0918E Inventory Query Backup for node FACE_TSMCL terminated - lock conflict. (SESSION: 16678) 10/13/2008 02:14:54 ANE4952I (Session: 16678, Node: FACE_TSMCL) Total number of objects inspected: 102,836(SESSION: 16678) 10/13/2008 02:14:54 ANE4954I (Session: 16678, Node: FACE_TSMCL) Total number of objects backed up: 142(SESSION: 16678) 10/13/2008 02:14:54 ANE4958I (Session: 16678, Node: FACE_TSMCL) Total number of objects updated: 0(SESSION: 16678) 10/13/2008 02:14:54 ANE4960I (Session: 16678, Node: FACE_TSMCL) Total number of objects rebound: 0(SESSION: 16678) 10/13/2008 02:14:54 ANE4957I (Session: 16678, Node: FACE_TSMCL) Total number of objects deleted: 0(SESSION: 16678) 10/13/2008 02:14:54 ANE4970I (Session: 16678, Node: FACE_TSMCL) Total number of objects expired: 16(SESSION: 16678) 10/13/2008 02:14:54 ANE4959I (Session: 16678, Node: FACE_TSMCL) Total number of objects failed: 0(SESSION: 16678) 10/13/2008 02:14:54 ANE4961I (Session: 16678, Node: FACE_TSMCL) Total number of bytes transferred: 515.75 MB(SESSION: 16678) 10/13/2008 02:14:54 ANE4963I (Session: 16678, Node: FACE_TSMCL) Data transfer time: 90.35 sec(SESSION: 16678) 10/13/2008 02:14:54 ANE4966I (Session: 16678, Node: FACE_TSMCL) Network data transfer rate: 5,845.29 KB/sec(SESSION: 16678) 10/13/2008 02:14:54 ANE4967I (Session: 16678, Node: FACE_TSMCL) Aggregate data transfer rate: 72.90 KB/sec(SESSION: 16678) 10/13/2008 02:14:54 ANE4968I (Session: 16678, Node: FACE_TSMCL) Objects compressed by: 0%(SESSION: 16678) 10/13/2008 02:14:54 ANE4964I (Session: 16678, Node: FACE_TSMCL) Elapsed processing time: 02:00:43(SESSION: 16678) 10/13/2008 02:14:54 ANR2579E Schedule DAILY_2300_0600 in domain PDM_006_WIN for node FACE_TSMCL failed (return code 12). (SESSION: 16678) 10/13/2008 02:14:54 ANR0403I Session 16678 ended for node FACE_TSMCL (WinNT). (SESSION: 16678) 10/15/2008 12:12:41 ANR2017I Administrator HUEBERT issued command: QUERY ACTLOG se="session: 16678" begind=-7 (SESSION: 8455) This information reveals that at 01:32:19 (approx. 75 minutes after the start of the backup) the ANR0918E occurs. The backup continues to process more files, but in the end will report a status of failed (ANR2579E). An error during the start of a backup window might indicate a problem during the SYSTEM STATE backup. In the local dsmerror.log for this client: 10/13/2008 01:32:19 cuGetBackQryResp: Server aborted txn, reason: 1 10/13/2008 02:14:53 ANS1802E Incremental backup of 'WMI Database' finished with 2097152 failure 10/13/2008 02:14:53 ANS1802E Incremental backup of 'WMI Database' finished with 2097152 failure 10/13/2008 02:14:54 ANS1301E Server detected system error 10/13/2008 02:14:54 ANS1512E Scheduled event 'DAILY_2300_0600' failed. Return code = 12. As you can see here, at the same time of the ANR0918 on the TSM server the "cuGetBackQryResp: Server aborted txn, reason: 1" is logged. At 02:14:53 a ANS1802E (incremental backup failure) of the 'WMI Database' is reported (this is the same time as the summary report in the actlog and in dsmsched.log). The WMI database is part of the system state. My suspicion is that the error during the backup of the WMI Database actually is the error at 01:32:19, but is not reported until 02:14:54. Just the "cuGetBackQryResp" text is not very helpful. In the end the ANS1301E (Server detected system error) message is displayed. The file dsmsched.log displays the summary and the ANS1802E/ANS1301E error as well: 10/13/2008 00:14:09 Node Name: FACE_TSMCL 10/13/2008 00:14:10 Session established with server HANNIBAL: AIX-RS/6000 10/13/2008 00:14:10 Server Version 5, Release 4, Level 2.0 10/13/2008 00:14:10 Server date/time: 10/13/2008 00:14:11 Last access: 10/12/2008 13:02:07 10/13/2008 00:14:10 --- SCHEDULEREC OBJECT BEGIN DAILY_2300_0600 10/12/2008 23:00:00 10/13/2008 02:14:53 ANS1802E Incremental backup of 'WMI Database' finished with 2097152 failure 10/13/2008 02:14:54 --- SCHEDULEREC STATUS BEGIN 10/13/2008 02:14:54 Total number of objects inspected: 102,836 10/13/2008 02:14:54 Total number of objects backed up: 142 10/13/2008 02:14:54 Total number of objects updated: 0 10/13/2008 02:14:54 Total number of objects rebound: 0 10/13/2008 02:14:54 Total number of objects deleted: 0 10/13/2008 02:14:54 Total number of objects expired: 16 10/13/2008 02:14:54 Total number of objects failed: 0 10/13/2008 02:14:54 Total number of bytes transferred: 515.75 MB 10/13/2008 02:14:54 Data transfer time: 90.35 sec 10/13/2008 02:14:54 Network data transfer rate: 5,845.29 KB/sec 10/13/2008 02:14:54 Aggregate data transfer rate: 72.90 KB/sec 10/13/2008 02:14:54 Objects compressed by: 0% 10/13/2008 02:14:54 Elapsed processing time: 02:00:43 10/13/2008 02:14:54 --- SCHEDULEREC STATUS END 10/13/2008 02:14:54 ANS1301E Server detected system error This behaviour is happening for several of our Windows clients. Analysis learned that the match between the nodes is that they are still on the Windows 2000 Server platform, and are all running TSM Client level 5.2 or 5.3. In the IBM KB I've found several articles which describe the problem: The backup of the SYSTEM OBJECT is running as a single transaction and taking long (due to the large amount of small files in the system state) and other transactions need the same information (lock) in the TSM database. This is related to the ANR0918E (lock conflict) error. It is stated that "in a multithreaded client environment other client threads for this same node may end up having their transaction time out waiting for the lock(s) held by the system object transaction." By changing the RESOURCEUTILIZATION parameter to 1, you'll get a single thread (consumer and producer session combined). Thus, multithreaded backups are disabled. Through this APAR a modification is made to the transaction processing of system objects to avoid terminating other client sessions associated with a multithreaded backup. IC36769: TSM BACKUP OF SYSTEM OBJECTS ARE LONG RUNNING AND CAN LEAD TO FAILURES OF OTHER TRANSACTIONS NEEDING COMMON LOCKS. IC42909: ITSM WINDOWS CLIENTS BACKING UP SYSTEM OBJECTS CAN SUFFER PERFORMANCE PROBLEMS THAT CAN EXTEND THE BACKUP WINDOW. IC36769 should be fixed in TSM Client 5.1 and IC42909 in 5.2. As a TSM Client at level 5.3 still has this problem, but none of the 5.4 or 5.5 anymore, it is safe to assume that this problem is fixed a while ago. As soon as we're off the legacy Windows 2000 OS the TSM client can be upgraded and the problem should be gone. For the time being the RESOURCEUTILIZATION has been changed to 1 to avoid multithreaded backups. Client backup schedule hide-and-seek? (ANR2597W)Friday, October 24. 2008The node BA_TSMCL makes it daily backups successfully. At least that is what the TSM Client reports in the extraction below from the dsmsched.log: 08/18/2008 00:38:20 --- SCHEDULEREC OBJECT BEGIN DAILY_2300_0600 08/18/2008 21:00:00 08/18/2008 02:10:18 Total number of objects inspected: 44,048 08/18/2008 02:10:18 Total number of objects backed up: 3,196 08/18/2008 02:10:18 Total number of objects updated: 0 08/18/2008 02:10:18 Total number of objects rebound: 0 08/18/2008 02:10:18 Total number of objects deleted: 0 08/18/2008 02:10:18 Total number of objects expired: 11 08/18/2008 02:10:18 Total number of objects failed: 0 08/18/2008 02:10:18 Total number of subfile objects: 0 08/18/2008 02:10:18 Total number of bytes transferred: 704.89 MB 08/18/2008 02:10:18 Data transfer time: 1,742.08 sec 08/18/2008 02:10:18 Network data transfer rate: 414.33 KB/sec 08/18/2008 02:10:18 Aggregate data transfer rate: 195.36 KB/sec 08/18/2008 02:10:18 Objects compressed by: 0% 08/18/2008 02:10:18 Subfile objects reduced by: 0% 08/18/2008 02:10:18 Elapsed processing time: 08/18/2008 02:10:18 --- SCHEDULEREC STATUS END 08/18/2008 02:10:18 --- SCHEDULEREC OBJECT END DAILY_2300_0600 08/17/2008 23:00:00 08/18/2008 02:10:18 Scheduled event 'DAILY_2300_0600' completed successfully. 08/18/2008 02:10:18 Sending results for scheduled event 'DAILY_2300_0600'. 08/18/2008 02:10:18 Results sent to server for scheduled event 'DAILY_2300_0600'. You can notice that the schedule DAILY_2300_0600 starts at 00:38:20 and finishes at 02:10:18. As the backup window is open until 06:00 it does not come even close to the end of the backup window. There is plenty of time left. After the schedule is finished, the TSM server is informed about the successfully completed backup ("Send results for scheduled event"/"Results sent to server for scheduled event"). However, despite being reported in dsmsched.log that the results were sent to the server, this same server is reporting that it never received this information. With this result in the actlog during the next backup run: 08/19/08 23:03:32 ANR2597W Node BA_TSMCL in domain PDM_WINDOWS began schedule DAILY_2300_0600 at 08/19/08 00:38:20 and is In Progress. The period expired at 08/19/08 23:00:00. The next backup on 08/19/08 will not start due to this ANR2597W message telling you that node BA_TSMCL has began its schedule the previous day at 00:38:20 and is still "In Progress". As it is now 24 hours after the previous backup window opened the ANR2597W message is issued. This is the sequence of events: 08/18/08 23:00:00 ----> Backup window opens 08/19/08 00:38:20 ----> Schedule starts and will complete successfully (locally) but keeps 'In Progress' on the server 08/19/08 23:00:00 ----> Backup windows closes (24 hours after start) and at the same time starts new window In short, from a TSM client point of view there is no problem. From the TSM server point of view there is, as the new backup run won't start because the status of the previous backup run was never reported back by the client (and is still assumed as being in progress), and as a result, now it wil not permit a new backup session. After a while the connections are reset and the next day (the 2nd backup run after the ANR2597W message) the backup is permitted to run again. It might run without problems for a few days, and after that the ANR2597W message appears again. When you look at 'h 2597' there is an explanation and there are 3 possible causes listed: Explanation: This message is displayed when a node started a schedule, the period for the schedule expired, and the node has not reported results for the schedule. Possible causes are: 1 - processing for the action is taking longer than the schedule period 2 - a problem occurred while the node was performing the scheduled action 3 - the node completed the scheduled action but did not report the results to the server As #1 and #2 are not an option, cause #3 must be the problem. When you look at the backups for this node over the past week, you can see that the status will remain 'In Progress' despite the two successfull backups afterwards: 'Q EV * * NODE=BA_TSMCL BEGIND=-7 ENDD=today': Scheduled Start Actual Start Schedule Name Node Name Status -------------------- -------------------- ------------- ------------- --------- 08/15/08 23:00:00 08/16/08 02:56:50 DAILY_2300_0600 BA_TSMCL Completed 08/16/08 23:00:00 08/17/08 00:20:20 DAILY_2300_0600 BA_TSMCL Completed 08/17/08 23:00:00 08/18/08 01:08:42 DAILY_2300_0600 BA_TSMCL Completed 08/18/08 23:00:00 08/19/08 00:38:20 DAILY_2300_0600 BA_TSMCL In Progress 08/19/08 23:00:00 08/20/08 01:13:51 DAILY_2300_0600 BA_TSMCL Completed 08/20/08 23:00:00 08/21/08 02:58:48 DAILY_2300_0600 BA_TSMCL Completed 08/21/08 23:00:00 DAILY_2300_0600 BA_TSMCL Future You might want to search for previous occurences of this problem. The problem is an old and known issue, reported in this IBM KB article. This article describes that the schedule got stalled and because of polling all subsequent schedules were missed, until a reset on the client was performed. This particular TSM Client was on TSM Client level 5.2 because the underlying OS is AIX 5.2 and can't be upgraded. This system will be legacy within a few weeks. I've seen the ANR2597W message once on a 5.3 client, but never on 5.4 or 5.5. As Google, ADSM.org and ADSM-L don't report ANR2597W in the context above, it is save to assume the problem is solved back in TSM 5.2/5.3 and is not a problem anymore in the current release(s). But for the legacy users (for whatever reason) among us of 5.2/5.3 this information can come in handy. A brief explanation of the SHAREDLIBIDLE parameterWednesday, October 15. 2008It is interesting to see how the purpose of the SHAREDLIBIDLE parameter has changed within TSM over time. Even more interesting is the fact that the SHAREDLIBIDLE server option (which need to be set in dsmserv.opt) is not mentioned in any of the administrator reference guides either for TSM 5.1, 5.2, 5.3, 5.4 nor 5.5. In TSM 5.1 the parameter was used in a shared tape environment when a library client is not mounting any tapes and the sessions are in media wait forever. The mount points are waiting on a dismount that has already occurred, but not signalled (APAR IC35283). By setting SHAREDLIBIDLE YES on the library client future hangs of this nature could be avoided. In TSM 5.2 the parameter was used when two or more servers are sharing a tape library and volumes are not left idle on library clients. After a session a volume is immediately dismounted. SHAREDLIBIDLE YES forces volumes being left idle in the drives for the use of the library clients so they can be used immediately again when the tape is needed the next time without an extra tape mount. This way excessive tape mounts/dismounts can be avoided under circumstances where you might need this. In addition to library clients, storage agents could also benefit from this. This is still the purpose of the SHAREDLIBIDLE parameter in the current release of TSM. It works in conjunction with the MOUNTRetention parameter of the sequential device class. The default of the SHAREDLIBIDLE server option is NO. In this entry on the ADSM-L mailing list someone has set the MOUNTRetention parameter of the sequential device class to a value of 9999, but volumes are still dismounted minutes after they're used. The solution is to place SHAREDLIBIDLE YES in dsmserv.opt and cycle the TSM server. You can also force a client node to keep its mount point during an entire session by using the KEEPMP parameter. The case of the failed merge test (DP for MSSQL)Thursday, October 9. 2008A MSSQL database started to fail to backup with DP for MSSQL; now after a few days it has become a consisted problem. The database name is "datastore". Below is an overview of the last backup run from tdpsql.log: 09/29/2008 20:53:51 Request : FULL BACKUP 09/29/2008 20:53:51 Database Input List : datastore 09/29/2008 20:53:51 Group Input List : - 09/29/2008 20:53:51 File Input List : - 09/29/2008 20:53:51 Number of Buffers : 3 09/29/2008 20:53:51 Buffer Size : 1024 09/29/2008 20:53:51 Number of SQL Buffers : 0 09/29/2008 20:53:51 SQL Buffer Size : 1024 09/29/2008 20:53:51 Number of Stripes specified : 1 09/29/2008 20:53:51 Estimate : - 09/29/2008 20:53:51 Truncate Log? : - 09/29/2008 20:53:51 Wait for Tape Mounts? : Yes 09/29/2008 20:53:51 TSM Options File : c:\Progra~1\Tivoli\tsm\TDPSql\dsm.opt 09/29/2008 20:53:51 TSM Nodename Override : - 09/29/2008 20:53:51 Sqlserver : SQL00001 09/29/2008 20:53:51 09/29/2008 20:54:02 ACO5436E A failure occurred on stripe number (0), rc = 1301 09/29/2008 20:54:02 ANS0328E (RC45) The specified objects failed the merge test. 09/29/2008 20:54:20 ACO5436E A failure occurred on stripe number (0), rc = 1301 09/29/2008 20:54:20 ANS0328E (RC45) The specified objects failed the merge test. 09/29/2008 20:54:20 Backup of datastore failed. 09/29/2008 20:54:20 ANS0328E (RC45) The specified objects failed the merge test. 09/29/2008 20:54:20 Total SQL backups selected: 1 09/29/2008 20:54:20 Total SQL backups attempted: 1 09/29/2008 20:54:20 Total SQL backups completed: 0 09/29/2008 20:54:20 Total SQL backups excluded: 0 09/29/2008 20:54:20 Total SQL backups inactivated: 0 09/29/2008 20:54:20 Throughput rate: 110.72 Kb/Sec 09/29/2008 20:54:20 Total bytes transferred: 3,230,688 09/29/2008 20:54:20 Elapsed processing time: 28.50 Secs 09/29/2008 20:54:20 ACO0151E Errors occurred while processing the request. 09/29/2008 20:54:20 ACO0151E Errors occurred while processing the request. On overview of all the errors can be seen in dsierror.log: 09/15/2008 20:49:59 ANS0328E The specified objects failed the merge test. 09/15/2008 20:50:00 ANS0328E The specified objects failed the merge test. 09/15/2008 21:03:23 ANS0328E The specified objects failed the merge test. 09/15/2008 21:03:24 ANS0328E The specified objects failed the merge test. 09/15/2008 21:03:26 ANS0328E The specified objects failed the merge test. 09/15/2008 21:03:27 ANS0328E The specified objects failed the merge test. ...cutout 09/28/2008 22:50:43 ANS0328E The specified objects failed the merge test. 09/29/2008 05:50:41 ANS0328E The specified objects failed the merge test. 09/29/2008 05:50:42 ANS0328E The specified objects failed the merge test. 09/29/2008 05:50:44 ANS0328E The specified objects failed the merge test. 09/29/2008 05:50:45 ANS0328E The specified objects failed the merge test. 09/29/2008 20:54:02 ANS0328E The specified objects failed the merge test. 09/29/2008 20:54:20 ANS0328E The specified objects failed the merge test. As you can see, the ANS0328E problem has been around since 09/15/2008. The "datastore" DB is the only database on this SQL server with problems. Any other DB on this server will backup without any problems. The "ACO5436E A failure occurred on stripe number (0), rc = 1301" is not relevant in this case, it is added as some sort of bonus to confuse us TSM administrators even more. Below are the include statements from c:\Progra~1\Tivoli\tsm\TDPSql\dsm.opt: Include "\...\full*" MC_SQL Include "\...\difffull*" MC_SQL Include "\...\log*" MC_SQL_LOG INCLUDE "\...\meta\...\*" MC_SQL_META This particular setup is chosen so that a full or difffull SQL DB backup will be send to the copy destination of the management class MC_SQL. SQL logfiles will be send to another diskpool than the (diff)full backups and SQL metadata will be send to another diskpool as well. When you want to spread out different types of backup data to different storage pools, you'll need these include statements to setup another MC than the default of the policy domain to which the DP for MSSQL node belongs. The problem is that the MC's as listed in dsm.opt have been defined recently, and since the include statements were changed (so no longer the DEFAULT MC is used, but this new ones) a backup never succeeded. The ANS0328E message is encountered ever since. When you search on this problem in the IBM KB it turns out to be a very known phenomenon (first reported in 2001/2002), and is even treated by IBM on a "by design" basis. The APAR's dealing with this problem are even closed as 'permanents restriction'. So the problem is not considered a "problem". It is TSM working as designed. The ANR0328E comes up when the DP for MSSQL want to backup something that it has already backed up previously with a different MC (or: the MC's has been changed between two backups). When a MC's is changed (through include statements or policy domain changes) the problem will arise. But this will only occur if the meta data and bitfile data objects have different copy destinations, which is the case. Generally, you'll want the meta data objects on a random disk device (to prevent slow tape mounts during a restore when the first thing you'll most likely need is a directory listing) and the bitfile objects on tape (or disk as well in this case). The fix from IBM is really simple; "Do not change the management class that TDP for MS SQL backups are bound to if multiple copy destinations are used. If a management class change is necessary, create a new nodename and perform a full backup with the new management class settings using that nodename for all future backups." Fantastic. You'll need to retain the "old" nodename and it's stored objects until they're expired and you can than delete the filespaces and remove the node from TSM. Another workaround might be to delete or rename existing filespaces to continue without problems with the same nodename. When you delete the filespaces your backup history is gone and the backup will work again as there is no previous data with a different MC. For a rename the effect is all the same, but you'll get to keep your backup history. Think about how important the history is (does your DBA really need it?) and maybe it is best to just delete all the filespaces and start fresh with the new MC's from scratch. From the IBM KB: "This problem is caused by the fact that TDP for SQL needs to maintain logical integrity of multiple TSM objects within a single transaction. The problem is compounded by the fact that you can not have backup objects within a single transaction span different copy destinations. TDP for SQL uses temporary objects across multiple transactions for the initial backup. It then creates a single transaction to rename all of those objects to their proper TSM object name. The problem comes in when the management class has changed from the previous backup, and the new objects have different copy destinations". But what if the copy destination is the same for all the new MC's, or the MC is just renamed without a change to the original copy destination? Will that work? This situation is not addressed in the IBM KB article. However, the problem conclusion is that this limitation in the "current" design will need to be changed in order to address this unique situation. As this document was last updated 2002-06-28, the limitation was never resolved, and is now considered as TSM working by design. The KB documents refer to a change of MC's as a specific, not general ("unique") situation, which of course it isn't. Besides that: you should be able to change your MC parameters without the loss of previous data. With this query we searched for all the backups from the database "datastore" for the node SQL00001_TSMCL-SQL: SELECT * FROM BACKUPS WHERE NODE_NAME='SQL00001_TSMCL-SQL' and and HL_NAME='\datastore\' NODE_NAME FILESPACE_NAME FILESPACE_ID STATE TYPE HL_NAME LL_NAME OBJECT_ID BACKUP_DATE DEACTIVATE_DATE OWNER CLASS_NAME SQL00001_TSMCL-SQL SQL00001\data\0001 2 ACTIVE_VERSION FILE \datastore\ full 42592448 2008-09-15 01:01:31.000000 DEFAULT SQL00001_TSMCL-SQL SQL00001\data\0001 2 INACTIVE_VERSION FILE \datastore\ full 36420386 2008-07-22 20:45:21.000000 2008-07-23 20:47:25.000000 DEFAULT SQL00001_TSMCL-SQL SQL00001\data\0001 2 INACTIVE_VERSION FILE \datastore\ full 36510394 2008-07-23 20:47:25.000000 2008-07-24 20:47:36.000000 DEFAULT ....cutout SQL00001_TSMCL-SQL SQL00001\meta\0000 1 INACTIVE_VERSION FILE \datastore\ full 42269940 2008-09-12 20:49:42.000000 2008-09-13 20:52:37.000000 DEFAULT SQL00001_TSMCL-SQL SQL00001\meta\0000 1 INACTIVE_VERSION FILE \datastore\ full 42340563 2008-09-13 20:52:37.000000 2008-09-15 01:01:32.000000 DEFAULT As you can see, the backups "stopped" at 2008-09-15, with a time stamp just before the first ANS0328E error in dsierror.log. The field CLASS_NAME contains the value of DEFAULT, which is different from the new values with the name MC_SQL*. How can another SQL db (by the name of taxlisting) for this node backup without problems? The same query, but now for the database 'taxlisting': SELECT * FROM BACKUPS WHERE NODE_NAME='SQL00001_TSMCL-SQL' and HL_NAME='\taxlisting\' NODE_NAME FILESPACE_NAME FILESPACE_ID STATE TYPE HL_NAME LL_NAME OBJECT_ID BACKUP_DATE DEACTIVATE_DATE OWNER CLASS_NAME SQL00001_TSMCL-SQL SQL00001\data\0001 2 ACTIVE_VERSION FILE \taxlisting\ difffull 45482419 2008-10-01 21:12:53.000000 MC_SQL SQL00001_TSMCL-SQL SQL00001\data\0001 2 ACTIVE_VERSION FILE \taxlisting\ full 45297559 2008-09-30 17:43:07.000000 MC_SQL SQL00001_TSMCL-SQL SQL00001\data\0001 2 INACTIVE_VERSION FILE \taxlisting\ difffull 45299574 2008-09-30 21:13:11.000000 2008-10-01 21:12:53.000000 MC_SQL SQL00001_TSMCL-SQL SQL00001\meta\0000 1 ACTIVE_VERSION FILE \taxlisting\ difffull 45482425 2008-10-01 21:12:56.000000 MC_SQL_META SQL00001_TSMCL-SQL SQL00001\meta\0000 1 ACTIVE_VERSION FILE \taxlisting\ full 45297560 2008-09-30 17:43:16.000000 MC_SQL_META SQL00001_TSMCL-SQL SQL00001\meta\0000 1 INACTIVE_VERSION FILE \taxlisting\ difffull 45299575 2008-09-30 21:13:12.000000 2008-10-01 21:12:56.000000 MC_SQL_META You can see that the first backup for this database is from 2008-09-30 with the correct CLASS_NAME value. This database was first backed up AFTER the MC changes were made; so there was no previous backup data with the old MC name (DEFAULT) to interfere. The filespaces for the database "datastore" were deleted and after that the DP for SQL backup completed successfully right away. From the TSM server actlog: 10/02/2008 07:56:26 ANE4991I (Session: 358608, Node: SQL00001_TSMCL-SQL) TDP MSSQL Win32 ACO3001 Data Protection for SQL: full backup of database datastore from server SQL00001 completed successfully.(SESSION: 358608) 10/02/2008 07:56:26 ANE4991I (Session: 358608, Node: SQL00001_TSMCL-SQL) TDP MSSQL Win32 ACO3008 Data Protection for SQL: Backup of server SQL00001 is complete. Total SQL backups selected: 1 Total SQL backups attempted: 1 Total SQL backups completed: 1 Total SQL backups excluded: 0 Total SQL backups inactivated: 0 Throughput rate: 8,186.79 Kb/Sec Total bytes transferred: 17,177,328 Elapsed processing time: 2.05 Secs(SESSION: 358608) Below are 3 related TSM KB articles: http://www-01.ibm.com/support/docview.wss?uid=swg1IC33026 http://www-01.ibm.com/support/docview.wss?uid=swg1IC31372 http://www-01.ibm.com/support/docview.wss?uid=swg1IC31617 The related TSMBLOG.org KB article can be found here.
(Page 1 of 18, totaling 107 entries)
» next page
Competition entry by David Cummins powered by Serendipity v1.0 |
TSMBLOG.org![]() QuicksearchTSM around the globeLinksSyndicate This BlogStatisticsLast entry: 2008-11-27 10:37
107 entries written
|
