HLRN System - News Archive 2009


IDPublished Subject
[1588] Dec 25, 2009ICE2 Core Numbering
[1587] Dec 18, 2009Newsletter: Eighth edition of HLRN Informationen published
[1585] Dec 17, 2009New Policies for Data Handling and Archiving
[1583] Dec 14, 2009HLRN site Berlin ICE network problems
[1579] Dec 11, 2009ICE2 memory upgrade, nodes only partially available
[1576] Dec 03, 2009HLRN Maintenance Berlin finished
[1574] Dec 02, 2009HLRN Maintenance Hannover finished
[1567] Nov 24, 2009HLRN Hannover hickup
[1565] Nov 23, 2009HLRN Maintenance Hannover 01.12.2009, Berlin 02./03.12.2009
[1563] Nov 18, 2009HLRN-II improves position among the world's fastest computers
[1560] Nov 16, 2009MVAPICH2 1.4.0 available
[1558] Nov 10, 2009HLRN Hannover - Lustre Problem solved
[1557] Nov 10, 2009HLRN Hannover - Lustre Problems
[1554] Nov 04, 2009HLRN User Workshop "Intel Nehalem Update"
[1551] Oct 22, 2009Solved: HLRN Berlin testq on ice2 not available
[1547] Oct 19, 2009Solved: HLRN site Berlin closed Oct. 19 noon
[1543] Oct 16, 2009HLRN Hannover with InfiniBand problems - solved
[1539] Oct 16, 2009Change of default settings for MVAPICH2 and SGI-MPT
[1537] Oct 16, 2009VASP 5.2.2 available at HLRN
[1535] Oct 15, 2009New versions of CAE programs ANSYS, CFX, Fluent, Starc-CD
[1533] Oct 13, 2009Maintenance finished in Hannover
[1522] Oct 08, 2009HLRN problem report form ready for use
[1526] Oct 07, 2009Maintenance finished in Hannover
[1525] Oct 07, 2009Maintenance at HLRN Complex in Hannover
[1520] Oct 07, 2009Software Maintenance finished in Berlin Wed Oct 07
[1518] Oct 01, 2009Software Maintenance starting Thu Oct. 01, 2009 on both HLRN sites
[1516] Sep 25, 2009HLRN data handling and data security
[1514] Sep 25, 2009HLRN Hannover up again
[1505] Sep 24, 2009HLRN downtime Hannover - good news
[1501] Sep 23, 2009Production stop in Hannover
[1498] Sep 14, 2009Overdrawing of NPL quotas extended
[1495] Sep 14, 2009Limits for batch job class testq increased
[1494] Sep 14, 2009OpenFOAM Version 1.6 is installed
[1492] Sep 10, 2009PERM Maintenance on BICE
[1490] Sep 07, 2009Berlin open for all users (bigq)
[1486] Sep 03, 2009Maintenance finished in Berlin
[1483] Sep 03, 2009Maintenance finished in Hannover, Berlin delayed
[1481] Aug 31, 2009General Availability of the ICE2 System
[1475] Aug 27, 2009HLRN Downtime in Hannover Sept 1, 2009
[1474] Aug 26, 2009Downtime and Maintenance Day in Berlin at ZIB, September 1, 2009
[1469] Aug 19, 2009TUHH Workshop "Parallelprogrammierung" in Hamburg
[1470] Aug 18, 2009HLRN second installation phase - acceptance period
[1466] Aug 14, 2009HLRN data servers available in batch jobs
[1462] Aug 13, 2009HLRN Hannover downtime finished
[1460] Aug 13, 2009HLRN site Berlin downtime finished
[1458] Aug 12, 2009Extended Downtime of HLRN sites Berlin and Hannover
[1456] Aug 11, 2009Complex Berlin down
[1454] Aug 10, 2009Downtime on Berlin complex
[1449] Aug 10, 2009HLRN Hannover downtime - update
[1446] Aug 07, 2009HLRN Berlin downtime finished
[1444] Aug 07, 2009Hannover downtime finished
[1442] Aug 06, 2009HLRN Hannover Downtime
[1441] Aug 05, 2009HLRN second installation phase - Downtime Berlin complex
[1437] Jul 30, 2009HLRN second installation phase - some jobs lost in Hannover
[1436] Jul 29, 2009$HOME lost at HLRN site Berlin
[1433] Jul 29, 2009HLRN VI-HPS Tuning Workshop 2009
[1431] Jul 24, 2009HLRN Hannover up again - addendum
[1427] Jul 24, 2009HLRN Hannover up again
[1424] Jul 23, 2009HLRN second installation phase - Downtime Hannover complex extended
[1420] Jul 22, 2009Changes of Batch System Configuration
[1419] Jul 17, 2009Downtime of Hannover complex and backup
[1416] Jul 15, 2009HLRN second installation phase - Downtime Hannover complex
[1413] Jul 14, 2009hicegate1 available again
[1410] Jul 14, 2009BICE is down for emergency file system check
[1406] Jul 13, 2009Maintenance of tape library in Hannover
[1401] Jul 04, 2009HLRN Hannover available again (was: Queues closed in Hannover)
[1391] Jul 01, 2009Personal Allocations increased to 1000 NPL per Quarter
[1380] Jun 30, 2009Project applications - next deadline July 28, 2009
[1393] Jun 30, 2009Queues closed in Hannover for emergency maintenance (was: hg2 crashed in Hannover)
[1389] Jun 23, 2009Batch system maintenance Jun 18 (Hannover), Jun 23 (Berlin)
[1388] Jun 23, 2009Newsletter: Seventh edition of HLRN Informationen published
[1386] Jun 18, 2009Maintenance of $PERM in Berlin
[1381] Jun 16, 2009NAMD 2.7b1 available at HLRN
[1376] Jun 08, 2009NFS server crashed in Hannover
[1375] Jun 05, 2009$HOME server crashed in Berlin 05.06.2009
[1370] Jun 03, 2009ZIB: Stellenangebot im Bereich HPC
[1369] Jun 02, 2009Interactive CPU time limit
[1366] May 29, 2009Network infrastructure maintenance at RRZN starting Jun 03, 18:00
[1363] May 27, 2009hicegate 1 available again - tipps for using the gateways
[1357] May 25, 2009Ferret v6.2 available
[1354] May 25, 2009Home file system crashed in Hannover
[1352] May 15, 2009New Batch job class for serial jobs
[1343] May 13, 2009BICE Maintenance finished
[1341] May 12, 2009Comprehensive documentation of MPI libs available
[1338] Apr 27, 2009HLRN at the ISC09
[1331] Apr 23, 2009Hannover complex back to production status
[1324] Apr 23, 2009Information on the availability of the Hannover complex / Lustre Problem:
[1323] Apr 20, 2009Hannover complex not available
[1320] Apr 16, 2009Hannover complex not available
[1316] Apr 08, 2009Update: Hannover not available (was: Emergency drain of complex Hannover)
[1311] Apr 02, 2009GROMACS available at HLRN
[1308] Apr 01, 2009HLRN Maintenance Wed April 1, 2009 finished
[1305] Mar 25, 2009HLRN Maintenance Wed April 1, 2009
[1302] Mar 09, 2009HOME File System - Software Maintenance on Tuesday, March 10
[1301] Mar 05, 2009Many-Core and Reconfigurable Supercomputing Conference
[1299] Mar 04, 2009Webserver Certificate Change on Thursday, March 05 2009
[1297] Mar 02, 2009GAMESS_US version January 12, 2009 R1 available at HLRN
[1293] Feb 19, 2009Berlin XE not accessible Thu Feb 19
[1291] Feb 12, 2009HLRN Maintenance finished
[1285] Feb 05, 2009HLRN Maintenance from Monday February 09 until Thursday February 12, 2009
[1283] Feb 03, 2009No login to hd4 because of system upgrade
[1281] Feb 02, 2009Solved: Hardware problems with tape robot system in Hannover
[1279] Jan 22, 2009iscsi swap space deactivated on ICE Hannover
[1275] Jan 22, 2009Solved: Network problems between Complexes
[1273] Jan 19, 2009mpiexec available for SGI MPT
[1267] Jan 13, 2009Solved: Home directories not available in Hannover
[1263] Jan 06, 2009hicedata1/hd1 available again
[1260] Jan 05, 2009Resolved: no login possible to hd1/hicedata1



ICE2 Core Numbering
[1588] Dec 25, 2009

Please note the updated documentation regarding changed core numbering on ICE2 nodes:

Understanding the HLRN CPUs
MPI Implementations at HLRN
Example Program to Bind Processes to CPUs

/Bka


Newsletter: Eighth edition of HLRN Informationen published
[1587] Dec 18, 2009
The eighth edition of the HLRN newsletter HLRN Informationen is available for download at
https://www.hlrn.de/home/view/NewsCenter/NewsLetter.
This web page also contains instructions for (un-)subscribing to the newsletter mailing list.

From the content:
  • News about the HLRN-II system
  • Project report on "Hadron Physics from Lattice QCD"
  • Parallel programming and Optimization workshops in January/February 2010
(wwb)



New Policies for Data Handling and Archiving
[1585] Dec 17, 2009
Although the HLRN-II system provides much more space for data storage compared to the HLRN-I system, file systems already begin to fill up.

On Monday, January 11, 2010, HLRN will begin to put a set of new policies and concepts for data handling and archiving into effect step by step. Long-time users of HLRN should already be familiar with these measures. The whole set includes important changes for the file systems $HOME, $WORK, and $PERM, that will be introduced over time:
  • $WORK: Files on the global file system $WORK will have a limited lifetime. This will be introduced on January 11, 2010, see also below.
  • $HOME: File space and inodes in $HOME will be limited by quotas.
    A defined part of each user`s $HOME will be not replicated to the other complex. This will be introduced in the near future.
  • $PERM: The archive file system $PERM will be accessible on a new DATA node, only. $PERM will not be available anymore on the data nodes bicedata1-4 and hicedata1-4. This will be introduced in the near future.
As a first measure, on Monday January 11, 2010 a limited lifetime is introduced for files on $WORK: files that have not been accessed for 100 days will be deleted. Affected users will be informed by email one week before the file deletion is performed: the first email will be sent on January 11, affected files will be deleted on January 18.

For all the details see the HLRN online documentation on "Data Handling" which is regularly updated to reflect the current status.

All users have also been notified by Email.

(wwb)



HLRN site Berlin ICE network problems
[1583] Dec 14, 2009

Since Mon Dec 14 about 11:00 HLRN site Berlin experiences network problems affecting the ICE clusters.

All batch jobs running on ICE nodes have been aborted.
Interactive work on login and data serves is disturbed as well.
Processing of batch jobs has been stopped.

A note will be posted when the problems are solved.

We apologize for the inconvenience.

/Bka


ICE2 memory upgrade, nodes only partially available
[1579] Dec 11, 2009

Starting on Monday, December 14, HLRN will perform hardware maintenance on the ICE2 clusters in Berlin and Hannover. During the maintenance the memory of all "medium memory" nodes (feature mmem) will be upgraded from 24 GB to 48 GB for a small number of nodes at a time. Therefore the number of ICE2 nodes available for user jobs will be temporarily reduced.

After the maintenance all ICE2 nodes will have 48 GB of memory (i.e. 6 GB per core) and the total memory size of each of the two ICE2 clusters will have increased from 30 TB to 46 TB. This enables the solution of even larger problem sizes on the ICE2 in the near future.

Please note that after completion of the memory upgrade in January
  • the batch system feature mmem will be removed
  • the online documentation will be updated to reflect the final ICE2 configuration
We will post a note when the upgrade has finished.

(wwb)



HLRN Maintenance Berlin finished
[1576] Dec 03, 2009

Since Thu Dec 3, 20:40, HLRN site Berlin is available for users, again. Batch system work has been resumed.

Please note the updated documentation regarding changed core numbering on ICE2 nodes:

Understanding the HLRN CPUs
MPI Implementations at HLRN
Example Program to Bind Processes to CPUs

/Bka


HLRN Maintenance Hannover finished
[1574] Dec 02, 2009

Since Wed Dec 2, 19:00, HLRN site Hannover is available for users, again. Batch system work has been resumed.

Maintenance of HLRN site Berlin is still in progress.

We are sorry for the delay.

/Bka


HLRN Hannover hickup
[1567] Nov 24, 2009

Login had to be closed temporarily in Hannover because of two failed Lustre servers (gfs1 and gfs2 file systems) at 14:20 and 14:40. In addition, at about 15:40 we had a power failure. It seems nothing was lost and everything is back to normal now. Login is open again.

We are sorry for the inconvenience,

Gabriel Gaus



HLRN Maintenance Hannover 01.12.2009, Berlin 02./03.12.2009
[1565] Nov 23, 2009

System maintenance has been scheduled for

HLRN site Hannover: Tue 01.12.2009 08:00 - evening
HLRN site Berlin: Wed 02.12.2009 07:00 - Thu 03.12.2009 evening

Processing of batch jobs will be stopped. Interactive work will be not possible. A note will we posted when the systems are available for the public, again.

All ICE2 compute nodes with Intel Xeon Gainestown processors will undergo a BIOS upgrade. The upgrade implies a change of the core numbering scheme from legacy SGI mode (even cores on 1st socket, odd cores on 2nd socket) to common mode (cores 0-3 on 1st socket, cores 4-7 on 2nd socket). The numbering on ICE1, XE, login and data nodes with Intel Xeon Harpertown processors will stay at legacy mode.

Thus, the preferred map to achieve CPU and memory affinity on ICE2 nodes changes. The HLRN module files for SGI-MPT (v1.22 and newer) and MVAPICH2 (prior v1.4.0) as well as the MPI library MVAPICH2 v1.4.0 handle the change and the resulting inhomogenity of HLRN compute nodes transparently to the user. The HLRN documentation at

Hardware Overview
MPI implementations at HLRN
SGI MPT
MVAPICH2

will reflect the change in the near future. The SGI-MPT and MVAPICH2 module files set the following default maps:

Harpertown: 0,1,4,5,2,3,6,7; Gainestown old: 0,1,4,5,2,3,6,7 or 0,1,2,3,4,5,6,7; Gainestown new: 0,4,1,5,2,6,3,7

Note that the MPI module files have to be loaded in a batch job in order to detect the CPU topology of the running machine correctly. Loading via login scripts will not have the desired effect.

Users, that implemented own pinning schemes in their job scripts or executables, have to adapt their software to understand the new core numbering on ICE2 nodes as well as the old numbering on ICE1 and XE nodes. Otherwise performance will degrade.

See also Example Program to Bind Processes to CPUs for an example on how to analyze CPU topologies and how to achieve CPU and memory affinity in own codes.

/Bka


HLRN-II improves position among the world's fastest computers
[1563] Nov 18, 2009
The HLRN-II systems of the North-German Supercomputing Alliance (HLRN) have improved their positions in the Top500 list of the world's fastest supercomputers. The ICE systems in Berlin and Hannover moved up by four positions to No. 39 and 40 world wide, and by one position to No. 4 and 5 in Germany.

The ICE systems recorded a performance of 107.1 TFlop/s for the Linpack benchmark, achieving very efficient 89% of the peak performance. At each site in Berlin and Hannover both ICE clusters, ICE-1 and ICE-2, were measured as one combined system for the November 2009 edition of the Top500 list. For the previous list in June 2009 only the ICE-2 clusters were benchmarked.

(ssk/wwb)


MVAPICH2 1.4.0 available
[1560] Nov 16, 2009

Version 1.4.0 of the MPI implementation MVAPICH2 is available at HLRN.

New features that are of interest for HLRN users are:
  • support for dynamic process management according to the MPI-2 standard
  • support for efficient CPU binding on heterogeneous multi-core clusters
  • eXtended Reliable Connection (XRC) support
See MVAPICH2 1.4 Features for a more complete listing.

The HLRN default will be switched to this version on Mon, Nov. 23, 2009. Switching will be done by modifying the default mvapich2 module file to point to mvapich2/1.4.0-intel. Executables that were linked with MVAPICH2 shared libraries will use this version from then on.

The updated HLRN MVAPICH2 documentation already reflects this change. The MPI starter mpiexec used at HLRN was already updated to support MPI_Comm_Spawn and related for MVAPICH2 1.4, see its online man page.

The current MVAPICH2 default version is mvapich2/1.2.0-intel. Users, that wish to continue using this version instead of the new default, have to load this module file explicitely in the future.

/Bka


HLRN Hannover - Lustre Problem solved
[1558] Nov 10, 2009

HRLN Hannover is available again.

Harald Schwier


HLRN Hannover - Lustre Problems
[1557] Nov 10, 2009

Login is closed due to technical problems.

We apologize for the inconvenience.

Harald Schwier


HLRN User Workshop "Intel Nehalem Update"
[1554] Nov 04, 2009
The following announcement is in German as the workshop language will be German.

Der HLRN veranstaltet zusammen mit der Firma Intel einen eintägigen "Nehalem Update"-Workshop. Schwerpunktthema des Workshops sind die Unterschiede zwischen den Prozessoren der "Harpertown"-Architektur im ICE-1-Cluster des HLRN und der "Nehalem"-Architektur im ICE-2-Cluster aus Programmierer- und Anwendersicht.

Der kostenlose Workshop richtet sich an die Fachberater des HLRN sowie auch an interessierte erfahrene HLRN-Nutzer, die selbst größere Programme schreiben, optimieren oder portieren und die die im Workshop gewonnenen Kenntnisse als Multiplikatoren in ihren Arbeitsgruppen weitergeben können.

Zeitraum: Montag 23. November 2009, 9:30 bis 15:00
Ort: Berlin, Konrad-Zuse-Zentrum (ZIB)

Weitere Details und die Agenda sind auf der Workshopseite zu finden.

(wwb)


Solved: HLRN Berlin testq on ice2 not available
[1551] Oct 22, 2009

Since Mon Oct. 26, 14:20, testq on Berlin ICE2 nodes is available, again.


The original announcement was:

Due to network problems on Berlin ICE2 nodes that serve testq, batch jobs submitted at HLRN site Berlin into testq with -l feature=ice2 do not get scheduled to run.

Meanwhile, please use testq on ice1 or xe nodes.

A message will be posted when testq on Berlin ICE2 nodes is available, again.

/Bka


Solved: HLRN site Berlin closed Oct. 19 noon
[1547] Oct 19, 2009

Since Tue Oct 20 18:30 HLRN site Berlin is fully available to users, again. Processing of batch jobs has been resumed.

The original announcement was:

Since Mon Oct 19 12:45 HLRN site Berlin experienced
network-related problems, coupled with failures to access filesystems.

A number of batch jobs crashed. Interactive sessions hang.

Batch system work has been stopped 13:15. Login has been disabled 14:30.

We apologize for the inconvenience.

/Bka



HLRN Hannover with InfiniBand? problems - solved
[1543] Oct 16, 2009

Today starting 07:00 am we got MPI communicaton problems between a few blades in Hannover. Later on the problems gradually got worse and we finally had to restart the InfiniBand? fabric and other components.

Unfortunately, some jobs on Ice1 ran into their wallclock limit during the early evening because the network problems eventually affected access to file systems as well.

We hope everything is back up correctly since 22:00 and apologize for the inconvenience. Please resubmit jobs that did not finish correctly.

Gabriel Gaus



Change of default settings for MVAPICH2 and SGI-MPT
[1539] Oct 16, 2009

On Wed, Oct. 21, 2009, the following default environment settings will be applied for executables using the MPI libraries MVAPICH2 and SGI-MPT:

MVAPICH2: MV2_CPU_MAPPING=0:1:4:5:2:3:6:7
SGI-MPT: MPI_DSM_CPULIST=0,1,4,5,2,3,6,7:allhosts

These settings will be applied via the corresponding modulefiles for all installed library versions.

These settings handle specific processor-architecture related performance issues transparently to HLRN users. See Understanding the HLRN CPUs for background information.

It is expected that the majority of applications at HLRN benefit from these settings. However, there are a few cases, where performance degradation will occur. These are:
  • hybrid MPI+OpenMP runs with more than one thread per task
  • runs of MPI executables with more than one job per node (potentially applies to runs on XE nodes with naccesspolicy=singleuser and less than 8 tasks per node)

If your run scenario is among these cases, or if you are in doubt, you may revert to the library defaults by unsetting the changed environment variables.

Note that there exist applications at HLRN, which override these environment settings with own pinning strategies. Consult the documentation of these packages for information, or contact the responsible HLRN consultant if you have questions.

/Bka



VASP 5.2.2 available at HLRN
[1537] Oct 16, 2009

HLRN got a license for VASP 5.2, which allows us to generate VASP 5.2 binaries and provide access to them to HLRN users, which own a VASP 5.2 license themselves.

The access is completely decoupled from the access to VASP 4.6 binaries installed at HLRN. If you own a VASP 5.2 license and want to get access to the binaries we provide, ask for access via email to Bernhard Bandow, bandow_at_rrzn.uni-hannover.de.

Usage terms and details are found in the updated VASP documentation.

Registered VASP users were informed by a separate email.

/Bka


New versions of CAE programs ANSYS, CFX, Fluent, Starc-CD
[1535] Oct 15, 2009

ANSYS 12.0
Due to changes in the license code ANSYS 11.0 no longer works in parallel mode. Use ANSYS 12.0 instead. This is the default now (module load ansys). Usage of version 12.0 should be upward compatible. As ANSYS says the parallel performance of version 12.0 is better than that of version 11.0. You have to perform a one-time action before using ANSYS 12.0 for the first time. See our ANSYS web page for details.

ANSYS CFX 12.0
Due to changes in the license code ANSYS CFX 11.0 no longer works in parallel mode. Use ANSYS CFX 12.0 instead. This is the default now (module load cfx). Usage of version 12.0 should be upward compatible. You have to perform a one-time action before using ANSYS CFX 12.0 for the first time. See our ANSYS CFX web page for details.

ANSYS Fluent 12.0
The new Fluent version 12.0 is now an ANSYS product. Use "module load fluent/12.0" for this version. Usage of version 12.0 should be upward compatible with version 6.3.26. You have to perform a one-time action before using ANSYS Fluent 12.0 for the first time. See our FLUENT web page for details.
ANSYS Fluent 12.0 no longer comes with Gambit and TGrid. If you need these products, you have to use Fluent 6, which is still the default.

Star-CD 4.10
User coding with Star-CD 4.10 can now use two compilers: "module load starcd/4.10-pgi" selects the PGI compiler and "module load starcd/4.10-intel" selects the Intel compiler. The latter is the first version of Star-CD which uses the Intel compiler.

(gb)


Maintenance finished in Hannover
[1533] Oct 13, 2009

HRLN Hannover is available again.
Thank you for your patience.

Gabriel Gaus




HLRN problem report form ready for use
[1522] Oct 08, 2009

You want to report a problem to HLRN? Please use our new problem report form.

If you report a problem to HLRN you should be prepared to answer a number of questions and to have certain information ready (see also the information on HLRN support). By providing complete and detailed information you can greatly speed up the process of isolating and correcting errors you encounter while using the HLRN.

The form serves as a guideline, so please make use of it.

(wwb/swy)


Maintenance finished in Hannover
[1526] Oct 07, 2009

HRLN Hannover is available again.
Thank you for your patience.

Gabriel Gaus




Maintenance at HLRN Complex in Hannover
[1525] Oct 07, 2009

A planned infrastructure maintenance at the HLRN Complex in Hannover
will take place from Mon. Oct. 12. 6.00 am to Tue. Oct. 13.
During this time access to this complex will not be possible.
Submission of batchjobs will be possible until the downtime begins.
Jobs will remain scheduled and start after the downtime is over.
We will inform you on time about resuming service again.

Please plan accordingly.

(BB)


Software Maintenance finished in Berlin Wed Oct 07
[1520] Oct 07, 2009

HLRN site Berlin is partly available to users again since Wed Oct 07, 13:45.

Interactive work is possible on Berlin login nodes.
Processing of batch jobs has been resumed on the Berlin ICE1 and ICE2 clusters.

The Berlin data servers bd1 .. bd4 remain closed. Processing of batch jobs has not been resumed on the Berlin XE cluster. These systems will opened soon. Please watch the HLRN status page for status changes.

The opening of HLRN site Hannover will be announced in a separate message.

We thank you for your patience.

/Bka




Software Maintenance starting Thu Oct. 01, 2009 on both HLRN sites
[1518] Oct 01, 2009

An urgent software maintenance has been scheduled on both HLRN sites starting

Thu Oct. 01, 2009, 17:00

Login will be disabled at this time. Interactive sessions will be terminated. Processing of queued batch jobs will continue until Mon Oct. 05.

Usual production work will be possible again not before the middle of next week.

We apologize for the trouble.

/Bka




HLRN data handling and data security
[1516] Sep 25, 2009

Dear users,

I just learned that not everybody is aware about data handling policies at the HLRN and decided to annoy you with one more mail. In particular I would like to point out that a backup system is provided to save important data.

We just survived a problem with one of the Lustre file systems, where it was unclear whether everything could be recovered or whether we would have to live with some loss of data. The problem is that this might happen again, even though you can be sure that we invested quite some effort in data safety such as redundant servers and RAID-6. Still, failures apparently cannot completely be avoided.

Due to the number of components involved, the two Lustre file systems gfs1 and gfs2 appear somewhat more vulnerable. Lustre is a file system designed for scalability. That means data is distributed over several dozen servers and disk targets to make access fast even if thousands of clients are involved (which is the case at HLRN). If one of the targets fails, that part of data becomes inaccessible. It is therefore a file system suitable for work files and computing results which must be stored or loaded quickly, but not a good choice for long-term storage of final results. Because of the size and the nature of these quickly-changing file systems, it is impossible to provide an automatic backup like the one we have on $HOME.

NOTE: gfs1 and gfs2 are provided without automatic backup.

To secure your files, please see the documentation at https://www.hlrn.de/home/view/System/DataHandling, were you can find how to save results of successful jobs. It is very easy to use - simply log on to one of the four data servers per site and do a cp to /qfs1/perm/youraccount or just $PERM.


Have a nice weekend,

Gabriel Gaus



HLRN Hannover up again
[1514] Sep 25, 2009

HLRN Hannover is back to production since about 08:30 this morning.

Login Gateway hicegate2 is not yet up, please use hg1 or hd1-4 in the mean time.

Gabriel Gaus



HLRN downtime Hannover - good news
[1505] Sep 24, 2009

Sgi specialists have succesfully repaired the defective OST. Lustre (/gfs1) is fully available again. According to sgi, no data was lost.

The previously announced schedule therefore is obsolete and you do not need to copy data to gfs2 any more.

We still have some work to do and cannot immediately resume production, but it looks by far better than at any point in the last two days.

I will reboot hicegate and hicedata 1 and 2 at 20:00 (within the hour) to free them from various hanging processes. After that, I will try to resume production. Jobs that were put into UserHold will be freed again.


Gabriel Gaus


The original message was:
--------------------------
The Lustre file system at HLRN Hannover has inconsistencies in /gfs1. In particular, one part of the file system (one "OST") is corrupted. This means that most of the files are accessible, but those residing on the affected target are probably not.

We are trying to repair the file system, but do not know how far we will succeed or whether additional problems eventuate.

[...]




Production stop in Hannover
[1501] Sep 23, 2009

Lustre currently is not accessible in Hannover and production had to be stopped. A message will be posted when the problem is resolved.

Gabriel Gaus



Overdrawing of NPL quotas extended
[1498] Sep 14, 2009

The time period allowing overdrawing of approved NPL quotas for project accounts has been extended to the end of quarter 4/2009.
Overdrawing of NPL quotas for personal accounts is allowed until the end of quarter 3/2009.

Allowed overdrawing means: batch jobs will become accounted according to the current accounting rules, and job costs will be withdrawn from the appropriate account. The accounting data can be viewed by visiting the project and allocation management portal. When an account runs out of NPL, submission of further batch jobs is allowed.

HLRN users are encouraged to use this time period to gain experiental knowledge about job performance and job costs on the new ICE2 systems in order to write serious project applications.

HLRN expects serious resource and NPL estimates in project applications, that will be submitted until the next deadline (Oct. 28, 2009), and seriously founded informal requests for adaptions of approved quotas until the end of 2009.

See Announcement 1420 for the original announcement regarding this issue.

/Bka



Limits for batch job class testq increased
[1495] Sep 14, 2009

The following limits for batch jobs in class testq have been changed:

- max node count increased from 8 to 16
- max wallclock time increased from 1 hour to 3 hours

Defaults are left unchanged:

- default node count is 1
- default wallclock time is 30 minutes
- default feature is xe

Note that testq jobs requesting more than 8 nodes have to request one of the features ice, ice1, ice2, since they are not possible on XE nodes.

See https://www.hlrn.de/home/view/System/BatchSystem for configuration details concerning the batch system.

/Bka



OpenFOAM Version 1.6 is installed
[1494] Sep 14, 2009

OpenFOAM - the Open Source CFD Toolbox - version 1.6 is available on the HLRN system. See our OpenFOAM? web page for details on how to access and use the package and the online documentation.

(tc/wwb)


PERM Maintenance on BICE
[1492] Sep 10, 2009
Hardware maintenance will be performed on the tape archive in Berlin on Friday, September 11 from 8:00 until 10:00. During that time there will be no access to files on PERM that have been migrated to tape.

Please plan accordingly.

(wwb)


Berlin open for all users (bigq)
[1490] Sep 07, 2009

Berlin is open for general production since 14:09.


Gabriel Gaus


The original message was:
------------------------------------------------------
General Availability of the ICE2 System
HLRN is proud to announce the general availability of the powerful new ICE2 system for all users in two steps. [...]

For proper use of the different HLRN clusters it is absolutely vital to [...]



Maintenance finished in Berlin
[1486] Sep 03, 2009

Complex Berlin is back to production after resolving the network problems.

ICE1 and XE queues (i.e. testq, smallq and mediumq) are open again for all users. ICE2 (as announced) remains reserved for internal use until next Monday.


Thank you for your patience,

Gabriel Gaus



Maintenance finished in Hannover, Berlin delayed
[1483] Sep 03, 2009

Complex Hannover is now open for general use.

Please read the documentation concerning the new batch system configuration to learn about the changes implemented. bigq now definitely goes to ice2. To get ice1 blades, you have to use mediumq. testq still goes to XE nodes if you do not specify that you want the feature ice1 or ice2 - we will watch this and decide what defaults to use in a few days.


Complex Berlin delayed due to a network issue.

Network in Berlin intermittently is slow resulting in difficulties with the global file system Lustre. Unfortunately we cannot yet resume production there. We are investigating the issue and hope to be back as soon as possible.


Gabriel Gaus




General Availability of the ICE2 System
[1481] Aug 31, 2009
HLRN is proud to announce the general availability of the powerful new ICE2 system for all users in two steps.
  • The Hannover complex HICE including the ICE2 system will be opened on Wednesday, September 2 at 18:00 after the maintenance.
  • The ICE2 system at the Berlin complex BICE will be opened on Monday, September 7 at 14:00
    (Note: the ICE1 and XE systems will be available on Wednesday, September 2 at 18:00 after the maintenance).
Together with the ICE2 the second global filesystem /gfs2 will be available at each site on all nodes via the new environment variable $WORK2.
For proper use of the different HLRN clusters it is absolutely vital to Note: until the end of September 2009 system usage is accounted, but overdrawing of your NPL quotas is tolerated, see Announcement 1420 (Update: tolerated overdrawing is extended until end of December 2009).

(wwb)



HLRN Downtime in Hannover Sept 1, 2009
[1475] Aug 27, 2009

Complex Hannover is reserved for approximately 24 hours starting Sept 1st, 17:00 for some required benchmarks.

This means the following:

  • batch processing is suspended

  • all interactive machines are closed 16:30 pm in Hannover. Interactive sessions will be terminated

  • no access to data on the HLRN is possible until the end of the downtime (see also Announcement 1474, Downtime and Maintenance at ZIB)


Please take the appropriate measures.

Thank you for your support,

Gabriel Gaus


Downtime and Maintenance Day in Berlin at ZIB, September 1, 2009
[1474] Aug 26, 2009

The annual infrastructure maintenance will be performed at ZIB on Tuesday, September 1, from 7:00 until approx. 18:00. Due to additional maintenance work and performance measurements
  • starting 21:00 on Monday, August 31, the Berlin complex BICE and some HLRN services will be partially or fully unavailable for ALL users until 18:00 on Wednesday, September 2.
The schedule concerning user service during the downtime is as follows:

On Monday, August 31:
  • starting at 21:00 the archive system $PERM on BICE will be unavailable.
On Tuesday, September 1:
  • at 6:30 end of user service on BICE, interactive sessions and all running batch jobs on BICE are terminated.
  • Between 8:00 and approx. 17:00:
    • ZIB may be partially or entirely cut off from electric power and from the Internet
    • ZIB personnel can be contacted by phone, only
    • Email traffic to/from ZIB will be interrupted, but mail will not get lost
    • The web server of the Scientific Council including the user and project database server (zulassung.hlrn.de) will be unavailable
  • Approx. 18:00: Services are expected to resume, however, BICE remains offline until 18:00 on Wednesday, September 2 (see also Announcement 1475 for information concerning HICE).
All other services, e.g. the HLRN web and mail servers will remain available during the maintenance.

Thank you for your patience and understanding.

(wwb)



TUHH Workshop "Parallelprogrammierung" in Hamburg
[1469] Aug 19, 2009

The following announcement is in German, as the course will be held in German.

In Hamburg wird an der TU Harburg vom 28. Sept. bis 02. Okt. 2009 ein Workshop "Parallelprogrammierung" mit Schwerpunkt MPI und OpenMP durchgeführt.

Es werden Programmierkenntnisse in Fortran oder C/C++ und Unix-Kenntnisse vorausgesetzt. Die Teilnehmer erhalten die Gelegenheit, selbst parallele Programme zu entwickeln und auf dem HLRN-Rechner zu erproben.

Der Workshop ist kostenlos und bis zum Erreichen des Teilnehmerlimits auch offen für Teilnehmerinnen und Teilnehmer aus dem Bereich des HLRN. Nähere Informationen gibt es in der Ankündigung der TU Harburg. Anmeldungen bitte unbedingt und rechtzeitig per Email an workshop-hamburg@hlrn.de.
(sn/wwb)


HLRN second installation phase - acceptance period
[1470] Aug 18, 2009

HLRN and SGI have successfully finished the integration of the new compute and storage hardware into the complexes BICE and HICE (currently 320 MPP nodes "ICE1", each). At each site additional 960 MPP nodes "ICE2" will increase the peak performance of each complex by 90 TFlop/s. The global disk capacity will be doubled. Currently the new hardware is in the acceptance phase, which started Tuesday, August 18 12:00.

The ICE1 and XE systems are already available again for general use. The ICE2 systems will be opened for dedicated test users in the near future. General availablity of the ICE2 systems in Berlin and Hannover for all users will be announced in a separate message.

For further information on the new system please see our SGI system hardware overview and the latest HLRN Newsletter HLRN Informationen Nr. 7 (German only).

(wwb)



HLRN data servers available in batch jobs
[1466] Aug 14, 2009

The HLRN data servers

bicedata1 .. bicedata4 and hicedata1 .. hicedata4

have been integrated into the batch system.

Usual batch system commands like showq or qstat may be executed on these nodes.

Regular batch jobs may be submitted from these nodes. Interactive batch jobs submitted on these nodes to HLRN compute nodes will not run because of missing network connections.

Batch jobs may be run on these nodes, too. They are restricted to serial runs (-l nodes=1:ppn=1) for a max. wallclock time of 3 hours.
To request a data server in a job script, specify

#PBS -l nodes=1:ppn=1
#PBS -l feature=data

or alternatively

#PBS -l nodes=1:ppn=1
#PBS -q dataq

See

https://www.hlrn.de/home/view/System/BatchSystem

for further information.

We hope that the availability of the data servers for batch jobs is helpful in doing regular postprocessing and data handling work.

/Bka


HLRN Hannover downtime finished
[1462] Aug 13, 2009

Maintenance and the first set of benchmark runs are finished in Hannover and the system is available again.

Thank you for your patience.

Gabriel Gaus


HLRN site Berlin downtime finished
[1460] Aug 13, 2009

HLRN site Berlin is available to users again since Thu Aug 13 13:15.
Batch system work has been resumed.

Problems that arised during integration of additional compute nodes and additional Lustre storage have been resolved. Benchmarking on the additional compute nodes (MPP2 system ICE2) is under way. A note will be posted when these systems are available to all users.

/Bka


Extended Downtime of HLRN sites Berlin and Hannover
[1458] Aug 12, 2009

The current downtimes of both HLRN sites have to be extended to

Thu Aug 13 about noon.

Currently all systems are dedicated to performance measurements.

Thank you for your patience.

/Bka


Complex Berlin down
[1456] Aug 11, 2009

Complex Berlin had to be taken offline yesterday around 23:00 due to hanging processes and various problems with the Lustre file system. HLRN and sgi are investigating the problem.

We apologize for the inconvenience.

Gabriel Gaus



Downtime on Berlin complex
[1454] Aug 10, 2009

Dear HLRN users,

SGI has declared today (August 10) that the second stage of HLRN-II is ready for use in Berlin. On August 11 the acceptance period will start, which will take approximately 40 days. The Berlin complex will not be available

from Tuesday, 11 August 2009, 14:00
to Wednesday, 12 August 2009, 22:00.

During that time performance measurements will be performed which require an empty machine. Then the old system will be opened again.

The second stage is the most powerful part of the installation. It will provide over 80% of the overall capacity of HLRN-II.

We will inform you when the new machine will be opened for general use.

(stb)



HLRN Hannover downtime - update
[1449] Aug 10, 2009

We are currently waiting for a replacement line card for a failed Infiniband switch. We hope the problems resulting from the defective component can be solved this evening.

In addition, a set of benchmark runs was scheduled after the repair. The benchmark results are needed as one of the prerequisites for the acceptance of the new MPP part.

Production will then be resumed on Thursday.


We apologize for the interruption.

Gabriel Gaus



The original message was:
----------------------------------------------------
HLRN complex Hannover just crashed (about 14:50) with various symptoms like missing file systems, crashing blades.

We are investigating the problem.

The batch system in Hannover will be drained and login has to be closed for the moment.

Sorry for the inconvenience - more news to follow.

Gabriel Gaus


HLRN Berlin downtime finished
[1446] Aug 07, 2009

The downtime of HLRN site Berlin scheduled for Mon Aug 03 until Fri Aug 07 has finished on Sat Aug 08 22:30.

Login is enabled, batch system work has been resumed.

We apologize for the delay.

/Bka


Hannover downtime finished
[1444] Aug 07, 2009

HLRN complex Hannover is open for production work again since Fri Aug 07, 21:00.

System verification had shown, that there remain network problems. It is currently not possible to use Intel compilers inside of batch jobs running on ICE machines. ssh connections from batch jobs running on ICE machines into the world outside HLRN is not possible, too. Batch jobs that use these features are temporarily not possible.

We apologize for the inconvenience this may cause.
Thank you for your patience.

/Bka



HLRN Hannover Downtime
[1442] Aug 06, 2009

HLRN complex Hannover will be closed on Friday, August 07 for network maintenance from 11:50 to about 17:00. All login sessions will be terminated.

A message will be posted when the maintenance is finished.

Harald Schwier



HLRN second installation phase - Downtime Berlin complex
[1441] Aug 05, 2009

The HLRN installation continues with the integration of the new compute and storage hardware into the Berlin complex BICE (currently 320 MPP nodes ("ICE1"). Additional 960 MPP nodes ("ICE2") will increase the peak performance of the complex by 90 TFlop/s. The global disk capacity will be doubled.

For the integration, a downtime of BICE is scheduled:
  • Monday, August 03, 2009 access will be closed at 8:00.
  • Friday, August 07, 2009 access will be opened again in the evening.
After this downtime the ICE1 and XE systems in Berlin will be available again for general use.

The ICE2 will be opened for dedicated test users for some time later, only. General availablity of the ICE2 systems in Berlin and Hannover for all users will be announced in a separate message.

For further information on the new system please see the latest HLRN Newsletter HLRN Informationen Nr. 7 (German only).

(wwb)



HLRN second installation phase - some jobs lost in Hannover
[1437] Jul 30, 2009

Dear users of the HLRN system,

as you know, integration of the hardware for phase 2 started today in Berlin.

While the Berlin batch servers were taken down, some jobs in Hannover were cancelled at about 10:45 am due to an error in the reaction of the batch system.

Please resubmit jobs that were lost by this event.

You do not need to apply for a refund of lost NPL, since accounting is not enforced for this quarter and a negative NPL balance will not be transferred to the next period (starting October). This was announced in HLRN news 1420 (quoted below).


We are sorry for the inconvenience,

Gabriel Gaus


--------------------------------------------------
Changes of Batch System Configuration
[1420] Jul 22, 2009

[...]
3) Overdrawing of NPL quotas in personal and project accounts will be tolerated until the end of quarter 3/2009. That means, users may submit batch jobs although the balance in their accounts is negative. Nevertheless, all batch jobs will become accounted, and accounting data will appear in the project and allocation management system.



$HOME lost at HLRN site Berlin
[1436] Jul 29, 2009

On Wed July 29, the $HOME file system at HLRN site Berlin was inaccessible from about 18:45 to about 19:00. All batch jobs that were running at this time got lost.

We apologize for the trouble this causes. You have to resubmit this work. Batch system work was resumed at about 19:15.

/Bka


HLRN VI-HPS Tuning Workshop 2009
[1433] Jul 29, 2009

The North-German Supercomputing Alliance (HLRN) and the Virtual Institute - High Productivity Supercomputing (VI-HPS) organize a workshop on tuning of parallel programs using tools developed by the VI-HPS.

The workshop aims at the consultants of HLRN and at experienced users of HLRN who develop, optimize, and port larger programs themselves.

Time: Wednesday, September 9, 2009, 9:00 until Friday, September 11, 2009, 15:00.

Venue: Bremen, ZARM/University of Bremen

For details of the workshop please see the workshop web page.

(ln/wwb)


HLRN Hannover up again - addendum
[1431] Jul 24, 2009

Switch problems in the late evening led to aborted sessions, and I had to close the batch queues temporarily.

The Issue should be resolved now.


Thank you for your patience,

Gabriel Gaus



HLRN Hannover up again
[1427] Jul 24, 2009

Dear Users,

the HLRN is available again in Hannover. The integration with installation phase 2 was successful.

A few issues still remain, but batch processing has resumed and almost everything is back to normal. As always, if you notice peculiarities, do not hesitate to contact HLRN support.

hg2 is not yet available due to a problem with the network interface. Please use hg1, hd1, hd2 or hd3 as login servers. A message will be posted when the second gateway is available again.


Thank you for your patience,

Gabriel Gaus



HLRN second installation phase - Downtime Hannover complex extended
[1424] Jul 23, 2009

The downtime of HLRN complex Hannover has to become extended due to network problems. Service will be resumed not before Fri July 24 09:00.

We apologize for the inconvenience.

/Bka


Changes of Batch System Configuration
[1420] Jul 22, 2009

Due to the integration of additional 960 compute nodes per HLRN site a couple of changes of the configuration of the batch system will take place on Thu July 23 on both HLRN sites:

1) The new nodes (ICE2 nodes) will run jobs, that request them with
-l feature=ice2. This request is similar to requesting the queue bigq. This queue will accept jobs that request node counts in the range of 16 to 256 nodes for max. 12 hours.
At the beginning, this queue will be opened for selected users, and is available at HLRN site Hannover, only. A note will be posted, when the ICE2 cluster in Hannover will be opened for the public.

2) The current ICE1 nodes will run jobs, that request them with -l feature=ice1, which is similar to requesting the queue mediumq. This queue accepts jobs that request node counts in the range 4 to 64 nodes for max. 12 hours.
All jobs that were submitted to the ICE1 clusters before July 23, and that were queued in queue bigq, will be rerouted to the queue mediumq.
Note that jobs that request -l feature=ice1 and -q bigq will not run anymore.

3) Overdrawing of NPL quotas in personal and project accounts will be tolerated until the end of quarter 3/2009. That means, users may submit batch jobs although the balance in their accounts is negative. Nevertheless, all batch jobs will become accounted, and accounting data will appear in the project and allocation management system.

The documentation in

https://www.hlrn.de/home/view/System/BatchSystem

reflects the changes given above, although the current state of implementation is preliminary until the ICE2 clusters of both sites will be opened for public usage.

/Bka


Downtime of Hannover complex and backup
[1419] Jul 17, 2009

Dear HLRN Users,

with respect to the installation of HLRN-II stage 2 in Hannover next Monday we would like to remind you that there is no automatic backup for the work file system /gfs1/work/<user-account> (See data handling policy at HLRN:
https://www.hlrn.de/home/view/System/DataHandling).

As usual please take precautions to backup any vital data on /gfs1 yourself. We would be glad to provide assistance if needed.

(BB)



HLRN second installation phase - Downtime Hannover complex
[1416] Jul 15, 2009

Dear HLRN Users!

The second main phase for the HLRN installation is taking up pace for an integration of the new compute and storage hardware into the productive HLRN system. With this phase an overall additional 960 MPP nodes ("ICE2") will be integrated into the HLRN (currently 320 MPP nodes ("ICE1"). For the integration, a downtime and maintenance for the Hannover complex will be necessary:
  • July 20, 2009 access will be closed at 8:00.
  • July 23, 2009 access will be opened again at 20:00.

After this downtime and maintenance period the ICE1 at Hannover will be available again for productive use. The ICE2 at Hannover should be integrated into the system then but will only be available for dedicated test users for some time. Information on the common availablity of the ICE2 at Hannover will follow in a separate news message.

Regarding the tests on ICE2 at Hannover: It is possible that processing of batch jobs will be interrupted on short notice, for example: due to system configuration changes or further performance measurements.

The corresponding integration at the Berlin complex will presumably start in the time interval from August 3, 2009 to August 6, 2009. We hope that the "marriage" of the two phases will go on as scheduled and apologize for any inconvenience with this downtime.

For further information (German only) on the new system please see the latest HLRN Newsletter:
https://www.hlrn.de/twiki/pub/NewsCenter/NewsLetter/hlrn_informationen_7.pdf


Yours sincerely,

your HLRN-Team

(CPR)



hicegate1 available again
[1413] Jul 14, 2009

hicegate1 is available again for your work.

Thank you for your patience.

Gabriel Gaus


The original message was:
--------------------------
hicegate1 currently is in an unstable condition and will probably crash soon. To give us an opportunity to analyze the problem, login to hicegate1 will be closed today at 15:00 (i.e. in approx. ~ 80 minutes).

The machine will be rebooted after analysis, so any sessions still open will definitely terminate and unsaved data will be lost.

Please use hg2, hd1, hd2 or hd3 for your work until hg1 is open again.



BICE is down for emergency file system check
[1410] Jul 14, 2009

Update Jul 14, 12:35: The file system /gfs1 is available again. Batchproduction on BICE has resumed. Please check your jobs and data on BICE and resubmit any faulty jobs if necessary.

The original announcement was:

The Berlin complex BICE has been taken offline to analyze possible problems with the global Lustre file system /gfs1.

/gfs1 is unavailable on all nodes (access to files may be hanging), and the batch system has been stopped. Login has not yet been closed, as the home file system is stillavailable.

A note will be posted when the file system and batch production are available again.

We apologize for the repeated down times.

(wwb)


Maintenance of tape library in Hannover
[1406] Jul 13, 2009

On Wednesday, July 15th, maintenance is scheduled for the tape library in Hannover.

Access to files on tape will be interrupted from 08:00 am to 17:00 pm. Data on /qfs1 is probably accessible as long as it is still in the disk cache and not yet migrated to tape.

Gabriel Gaus



HLRN Hannover available again (was: Queues closed in Hannover)
[1401] Jul 04, 2009

HLRN Hannover is available again after a successful check of the Lustre file system.

The original message was:
--------------------------
hg2 crashed yesterday due to Lustre problems which require a full filesystem check on /gfs1. Therefore, I had to close the queues in Hannover. Currently running jobs may finish.

I will close login to Hannover today at 17:30.

If you urgently need something, please remember to use the data servers for copying (hd1, hd2, hd3) and NOT the remaining gateway (hg1).

A message will be posted when the system is accessible again.


Sorry for the trouble,

Gabriel Gaus




Personal Allocations increased to 1000 NPL per Quarter
[1391] Jul 01, 2009

In order to adapt current policies at HLRN to the upcoming delivery of additional 960 compute nodes per HLRN site, the personal allocations for each HLRN user have been increased from 100 NPL per quarter to 1000 NPL per quarter.

The personal allocations are provided via personal projects, that have the same name like the login name of the user, and that exist independently from other projects. The personal allocations can be used to prepare project proposals seriously, finish work for expired projects, continue working when project allocations are used up, do work without having any other project.

Users that are member of approved projects, usually do not touch their personal allocations, but use the project allocations, instead (otherwise the application for a project would be senseless). Every user is encouraged to visit the project administration portal to check if the setting of the default project fits his/her intentions. Check out the help pages of this portal for assistance.

For further information, see
https://www.hlrn.de/home/view/Service/Accounting

Contact your HLRN consultant or email to support@hlrn.de, if you have questions.

/Bka



Project applications - next deadline July 28, 2009
[1380] Jun 30, 2009
Resources of the HLRN system are allocated according to approved quotas of projects (see the " Application HowTo?" and the Scientific Board portal page).

The next deadline for project applications is July 28, 2009.

Please contact your local HLRN support person or your HLRN project consultant before submitting the application.

(wwb)



Queues closed in Hannover for emergency maintenance (was: hg2 crashed in Hannover)
[1393] Jun 30, 2009

hg2 crashed yesterday due to Lustre problems which require a full filesystem check on /gfs1. Therefore, I had to close the queues in Hannover. Currently running jobs may finish.

I will close login to Hannover today at 17:30.

If you urgently need something, please remember to use the data servers for copying (hd1, hd2, hd3) and NOT the remaining gateway (hg1).

A message will be posted when the system is accessible again.


Sorry for the trouble,

Gabriel Gaus



The original message was:
--------------------------
hg2 has to remain offline for a while. Please use hg1, hd1, hd2 or hd3 for the time being.

hg2 crashed due to reasons which are yet unclear. Ssh login will remain disabled for about one or two hours to give us time to investigate. You may log back in as soon as ssh access is open again.

We are sorry for the inconvenience and the loss of work.

Gabriel Gaus



Batch system maintenance Jun 18 (Hannover), Jun 23 (Berlin)
[1389] Jun 23, 2009

Batch job processing was interrupted

* at HLRN site Hannover on Thu June 18, 14:00 - 18:00
* at HLRN site Berlin on Tue June 23, 14:00 - 18:00

New versions of Torque and Moab were installed, which fix various issues that appeared during the last months. The most important changes from the viewpoint of the user are:

1) The Moab grid joining the two HLRN sites has been formed again. Within the grid, users may submit jobs requesting the remote or either site with the -l partition option. Moab commands like showq consist of a local and a remote view by joining information from both sites. Throttling policies like max. running jobs per user and queue always have a local scope, regardless the state of the grid.

2) The amount of memory that is available to batch jobs on a node has been reduced from the maximum currently available to 14.5 GByte per node (ICE1 nodes), or 55 GByte per node (XE nodes). This affects only jobs that request memory explicitely with the -l mem option, which is not mandantory. The change is used as scheduling criterion, and primarily avoids a too optimistic packing of jobs onto XE nodes, which have enough free memory at start time of jobs, but which experience increased memory demands of system processes later. True memory consumption of running jobs and overcommitting of real memory by batch jobs is still not controlled. However, it is recommended that batch jobs do not allocate more than the memory given above per node to retain job and system stability.

3) Interactive batch jobs submitted with msub -I or qsub -I get a default wallclock time limit set, when they do not specify -l walltime. Previously, these jobs were killed right after start.

4) Moab and Torque now have a consistent view on job attributes like Account, Queue, Walltime. This was not the case in the past and yielded contradictory actions of components of Moab and Torque on selected jobs.

See https://www.hlrn.de/home/view/System/BatchSystem for more information about batch system configuration and operation at HLRN.

/Bka



Newsletter: Seventh edition of HLRN Informationen published
[1388] Jun 23, 2009
The seventh edition of the HLRN newsletter HLRN Informationen is available for download at
https://www.hlrn.de/home/view/NewsCenter/NewsLetter.
This web page also contains instructions for (un-)subscribing to the newsletter mailing list.

From the content:
  • News about the HLRN-II system including Top500 position in the new June 2009 list
  • Project report on "Strukturbildung von amphiphilen Molekülen auf
    Oberflächen mit chemischen Mustern"
  • MPI/OpenMP and Tools Workshops in September 2009
(wwb)



Maintenance of $PERM in Berlin
[1386] Jun 18, 2009

On Monday, 22 June, maintenance will be performed on the tape drives in Berlin. From 10 to 12 h tape processing will be interrupted.

We apologise for the inconvenience.

(stb)


NAMD 2.7b1 available at HLRN
[1381] Jun 16, 2009

The molecular dynamics program NAMD version 2.7b1 is available at HLRN.
The previous version 2.6 is still the system default until further notice.

Load the module file namd/2.7b1 to use the new version.

See https://www.hlrn.de/home/view/System/NAMD for a description for the package.

(Bka)



NFS server crashed in Hannover
[1376] Jun 08, 2009

On Mon Jun 08, 2009 at about 09:30 a central NFS server hung, causing /sw to go offline.

I did a failover to the standby server which seems to have solved the problem. Please report job aborts resulting from this crash to support.

Gabriel Gaus



$HOME server crashed in Berlin 05.06.2009
[1375] Jun 05, 2009

On Fri Jun 05, 2009 about 14:30 the server serving the $HOME of users crashed. Due to that, all batch jobs running at that time died.

HOME is available in Berlin again since about 15:00. Batch processing has been resumed. HLRN and SGI will analyse the reasons of the crash.

HLRN will refund charges of batch jobs that were aborted due to this event. Refunds will appear in the database on Mon Jun 08.

We are sorry about this event.

/Bka


ZIB: Stellenangebot im Bereich HPC
[1370] Jun 03, 2009
Am Konrad-Zuse-Zentrum für Informationstechnik Berlin (ZIB) - Bereich Computer Science, Abteilung High Performance Computing (HPC) - ist ab sofort unbefristet die Stelle eines/r Wissenschaftliche/n Angestellte/n im Bereich HPC zu besetzen.

Nähere Einzelheiten entnehmen Sie bitte dem Stellenangebot des ZIB ( Link zur Ausschreibung).

(wwb)


Interactive CPU time limit
[1369] Jun 02, 2009

Dear HLRN Users,

starting immediately a CPU time limit of 3000 s per process is imposed for all login sessions. This became necessary after long running interactive processes have frequently caused problems and login nodes had to be rebooted.

Please try to run as many long running processes as possible in batch mode. Recall that single CPU jobs can be run in batch since May, see Announcement 1352.

Thank you for your cooperation.

(stb)



Network infrastructure maintenance at RRZN starting Jun 03, 18:00
[1366] May 29, 2009

The RRZN has to upgrade several important network infrastructure components on June 03rd starting 18:00. The work take until 06:00 am the next day.

The central HLRN components are not concerned by the maintenance, so batch processing will continue and the HLRN link should remain available. External components, however, will experience service interruptions, resulting in e.g. the abort of SSH sessions to the hicegates and the Hannover data servers.

You should be able to continue your work in Hannover by logging in via the Berlin machines and then doing ssh to Hannover. Please check that you use the internal network (marked with 10.15x.xx.xx adresses) and NOT the external interfaces (130.75.x.x). The external interfaces will be interrupted without warning. Please save any work done in your interactive sessions by June 03rd, 17:00.


Gabriel Gaus





hicegate 1 available again - tipps for using the gateways
[1363] May 27, 2009

hg1 is available again after crashing with an extremely high load. It is not yet fully clear what caused the crash, but the machine was completely clogged up. This has happened before.

We would like to kindly ask our users not to use the gateways for data processing or data transfers on a larger scale, if possible.

I will try to give you some examples based on what I have seen up to now:
  • A single scp of a file from an external machine up to a few GBytes once in a while should be ok.
  • Some pre- or post-processing up to 30 minutes on a single CPU might be ok, provided the task does not use more than 512 MB of RAM (both virtual and resident size, have a look at the top command) and the machine is not yet occupied by others. Please consider using a batch job if at all possible.
  • If you need interactive processing, try the data servers (currently hd1-3 and bd1-4). The gateways should only be used for light work even though they are equipped with considerable ressources.
  • If you want to transfer data, please use the data servers as well. This is particularly important when you want to copy TB files from Lustre to QFS (/gfs1 to /qfs1), which are both mounted on the data servers. So there is no need e.g. to scp or sftp from hg1 to hd2, you can do this locally with less ressources and much greater speed.
  • Jobs suspected of clogging the gateways may be killed without prior notice. We try to do this only in emergencies, but sometimes we will even have to kill potential candidates to save other users#&39; text editing work.
Rule of thumb: use your common sense - the machines are there for your use, but there are only two gates for each site, and all users share them for their interactive work.

We would like to apologize, of course, for the inconvenience caused by the crash.

Gabriel Gaus



Ferret v6.2 available
[1357] May 25, 2009

Ferret Version 6.2 is available at HLRN. Version 6.1 is still the default. Please use
module load ferret/dynamic_6.2
to use version 6.2.

For a list of bug-fixes and new features please visit
http://ferret.pmel.noaa.gov/Ferret/documentation/version-6-2-release-notes.

Martin Schmidt



Home file system crashed in Hannover
[1354] May 25, 2009

Due to a crash of the replication software for the home file system, both the home file system and partially /gfs1 were unavailable for about 50 minutes around noon today, resulting in a variety of symptoms in HLRN Hannover.

I initiated a failover, and the system should be healthy again since about 13:50.

If you experienced job crashes, please contact your consultant to get your NPL back.

We apologize for the disruption.


Gabriel Gaus



New Batch job class for serial jobs
[1352] May 15, 2009

Due to ongoing demands, a batch job class for serial jobs is provided on the HLRN machines.

The class name is serialq, and is provided on one XE node per HLRN complex. The class is designed for work that includes mainly post-processing and data handling. The default and max. wallclock time limit is 3 hours. The node access policy is always shared, which means that jobs of different users can run on the same node at the same time.

The class is a HLRN production class, and is part of the pipe queue setup at HLRN. That means: if a job does not request a class explicitely, and requests -l nodes=1:ppn=1, and a wallclock time limit up to 3 hours (or no wallclock time limit), then the job is routed into the class serialq automatically.

If a job that requests -l nodes=1:ppn=1 and a wallclock time limit up to 3 hours or no time limit should not run in this class, then the job has to request the production class smallq explicitely.

If you have question about how to use the batch system at HLRN, please see
https://www.hlrn.de/home/view/System/BatchSystem
or contact your HLRN consultant.

/Bka



BICE Maintenance finished
[1343] May 13, 2009

The check of the filesystem holding $WORK at HLRN complex Berlin has been finished successfully. No user files were found to be lost or damaged. Production has been resumed on Fri May 15, 06:30.

Thank you for your patience.

/Bka


The original announcement was:

Maintenance will be performed on the global Lustre file system /gfs1 on BICE in Berlin from Tuesday, May 12 8:00 until Wednesday, May 13 afternoon. The Berlin complex BICE will be unavailable during this time. Interactive access and batch production will be interrupted.

We will post a message when BICE is available again.

(wwb)


Comprehensive documentation of MPI libs available
[1341] May 12, 2009

HLRN provides a variety of different MPI implementations to its users. These are now summarized and assessed in "MPI Implementations at HLRN".

The document also focuses on the usage of the mpiexec command to do advanced work in batch jobs (like starting several MPI jobs per batch job in parallel), which should be interesting to the HLRN user community in general.

/Bka



HLRN at the ISC09
[1338] Apr 27, 2009

The HLRN will be present at the International Supercomputing Conference this year. In order to expand the ISC as a forum for scientists and engineers who utilise high performance computers for their research we pass the following announcement to the HLRN users:

Die ISC ist Europas führende internationale Konferenz auf dem Gebiet des High Performance Computing. Erstmalig findet sie dieses Jahr vom 23.-26. Juni in Hamburg statt. Mit ihrem viertägigen Programm rund um die Anwendung von Höchstleistungscomputern in Wissenschaft und Forschung sowie Konstruktion und Entwicklung ist die Konferenz eine einzigartige Plattform für den Erfahrungsaustausch und das Knüpfen von persönlichen Netzwerken in der Welt des Supercomputings. Das Konferenzprogramm umfasst nicht nur Keynote-Vorträge hochkarätiger Redner wie Andreas von Bechtolsheim von Arista Networks (und Mitbegründer von Sun Microsystems), sondern es thematisiert vor allem tagesaktuelle und praktische Anwendungen der Supercomputer. In diesem Jahr stehen neben speziellen Sessions zu Cloud Computing und Climate Modeling das Thema Aeronautics im Fokus. Weitere Informationen zur ISC09 unter www.isc09.org/program.

Erstmalig gibt es dieses Jahr auch ein "Student Volunteer Program". Während der Konferenz assistieren die Studenten dem ISC-Team, haben aber auch die Möglichkeit am den für sie interessanten Sessions teilzunehmen. Für Kost und Logis ist selbstverständlich gesorgt. Mehr Informationen unter www.supercomp.de/isc09/Participate/Overview.

We would like to encourage our users to take this opportunity to meet users of high performance computers from other disciplines as well as people from the HLRN Staff at the HLRN booth.

(BB)



Hannover complex back to production status
[1331] Apr 23, 2009

Login and batch job submission are now possible again.

Due to repeated failures of several different parts of the file system underneath /gfs1, some files in $WORK and $TEMP were lost. We did, however, get a list of the files that could not be reconstructed. No data was lost in $HOME.

Users will find a list named badfiles_<username>.txt in their respective home directories containing possibly lost files. If you do not have this file, you were fortunate.

It may be that the files mentioned in your personal list have already been deleted intentionally by you before the crash. Please check this.

You are reminded that due to the long interruption all jobs submitted before the downtime have been put into the UserHold? state and have to be either released or cancelled. Please check that your jobs do not need files from the badfiles list of missing data. Have a look in your /gfs1/tmp/<username> directory to clean up temporary files that should not be there any more.

Of course, we have tried to ensure everything runs smoothly again. Should you experience anything unusual, please contact support@hlrn.de

Thank you for bearing with us,

Gabriel Gaus



The original message was:
---------------------------------------------------
An invalid block map has been found on one of the Lustre object storage targets. The file system part residing on this OST has been remounted read-only.

This means that:
  • files residing on a small part of /gfs1 are currently accessible read-only. Attempts to append to such a file will fail.
  • the file system must be repaired as soon as possible to avoid further damage.
I have initiated a drain of complex Hannover. No further jobs will start at this moment, and I will close login to Hannover tomorrow morning at 4:00 am.

A message will be posted when service is resumed. We apologize for the disruption.

Gabriel Gaus



Information on the availability of the Hannover complex / Lustre Problem:
[1324] Apr 23, 2009

The result of the last checks of the file systems (Lustre/extFS) and the work done during the last night leads to the conclusion that SGI believes that the Lustre file system can be repaired without a complete re-creation, although there ist still some work to do today for preparing a reintegration of the file system.
  • The owners of files/data/relicts found that might have been damaged will be informed about the objects involved.
  • Due to the fact that there has been a longer production hold for the Hannover complex, many user jobs have already been run on the Berlin complex in the mean time. Currently the queue in Hannover ist still filled with jobs. In order not to automatically start jobs that have been completed, we will set the current Hannover jobs on user hold. Please take a look on your jobs and decide which to start and which to remove.
  • All batch jobs that are currently queued on the Hannover complex are put into the UserHold? state. Owners of these jobs are advised to revisit them in order to avoid duplication of work that has already been done in Berlin, or that became useless for other reasons. When access to Hannover is possible for users again, the owners of jobs that are queued in Hannover have the following choices:
    • Free jobs to let them become executed: execute mjobctl -u User jobId
    • Remove jobs that should not run: execute canceljob jobId
    • Do nothing. Jobs that are in the Hold state for more than 14 days from now will be removed from the queue automatically.
  • We will immediately inform you as soon as the system will be accessible again for production.
  • We apologize for the inconvenience.

(CPR)



Hannover complex not available
[1323] Apr 20, 2009

Due to problems with the Lustre file system the Hannover complex remains unavailable.

SGI and several experts are still working on the problem.

File system checks have not been finished yet.

There will be no production for an indefinite time.

We will immediately inform you if the system is accessible again.

We apologize for the inconvenience.

(BB)





Hannover complex not available
[1320] Apr 16, 2009

Due to problems with the Lustre file system the Hannover complex is still not available.

SGI is working on the problem. (At the moment file system consistency checks are running.)

There will be no production before next monday.

Next status information will be given by next monday.

We apologize for any inconvenience.

Gerd Brand





Update: Hannover not available (was: Emergency drain of complex Hannover)
[1316] Apr 08, 2009

After several repair actions, there are still problems with the Lustre file system in Hannover. Production must remain on hold for the time being.


We apologize for the inconvenience and hope to be up again soon.

Gabriel Gaus



The original message was:
---------------------------------------------------
An invalid block map has been found on one of the Lustre object storage targets. The file system part residing on this OST has been remounted read-only.

This means that:


  • files residing on a small part of /gfs1 are currently accessible read-only. Attempts to append to such a file will fail.

  • the file system must be repaired as soon as possible to avoid further damage.



I have initiated a drain of complex Hannover. No further jobs will start at this moment, and I will close login to Hannover tomorrow morning at 4:00 am.

A message will be posted when service is resumed. We apologize for the disruption.

Gabriel Gaus



GROMACS available at HLRN
[1311] Apr 02, 2009

The molecular dynamics simulation package GROMACS version 4.0.4 is now available at HLRN.

See https://www.hlrn.de/home/view/System/GROMACS for a description of the package.

(BB)



HLRN Maintenance Wed April 1, 2009 finished
[1308] Apr 01, 2009

HLRN site Berlin is accessible for users again since Apr 1, 2009, 17:00.
HLRN site Hannover is accessible for users again since Apr 1, 2009, 18:00.

A new Moab version has been installed, which behaves slightly different than before. The most important changes (bugs) for users are:

1) Interactive jobs get a default wallclock time limit of 0 seconds. They are cancelled immediately after job start. Thus, always specify a wallclock time limit on the command line. Take care to not exceed queue max. wallclock time limits.

2) Jobs that are submitted directly into the UserHold? state (#PBS -h) cannot be freed with mjobctl -u. They have to be freed with qrls and specifying the Torque Job ID (see qstat output, or showq -v output).

If you recognize additional unusual behaviour of the batch system, please drop a note to HLRN support. HLRN is in close contact with the vendor to eliminate these problems as soon as possible.

Thank you for your patience.

/Bka



HLRN Maintenance Wed April 1, 2009
[1305] Mar 25, 2009

HLRN sites Berlin and Hannover will undergo system maintenance on Wed April 1, 2009, starting 09:00 and ending afternoon.

During this time, both HLRN sites will be inaccessible. Batch system work is stopped. Login into HLRN machines is disabled. Open interactive sessions will be terminated.

During the maintenance, the software inventories of HLRN machines will be synchronized, firmware of the storage systems will be upgraded, and a major bug fix release of the batch system software will be installed.

A note will be posted when HLRN is accessible for users again.

Thank you for your patience.

/Bka



HOME File System - Software Maintenance on Tuesday, March 10
[1302] Mar 09, 2009
Software maintenance and tests will be performed on the replication software for the HOME file system on Tuesday, March 10, from 16:00 until 20:00. During the maintenance recent issues with the replication will be further analyzed and fixed. This may result in degradation or partial loss of replication functionality at times. Working with HOME should not be affected if you stay on one complex.

Advices for users:
  • Use HOME on one of the complexes, only. Do not use both complexes at the same time.
  • If you experience problems with your data on HOME (accessibility, possible corruption, etc.) please notify HLRN Support immediately.
Thank you for your cooperation in this matter.

(wwb)



Many-Core and Reconfigurable Supercomputing Conference
[1301] Mar 05, 2009
The Many-Core and Reconfigurable Supercomputing Conference (MRSC'09) will take place at Zuse Institute Berlin from March 25-26, 2009. For details please see the conference website at http://www.mrsc2009.org/. Registration deadline is Tuesday, March 17, 2009, 09:00 UTC.

(wwb/stk)


Webserver Certificate Change on Thursday, March 05 2009
[1299] Mar 04, 2009

(Deutscher Text siehe unten)

The current security certificate for the HLRN webserver expires and will be replaced by a new one on Thursday, March 05 2009.

In order to avoid error messages when visiting our web pages with Mozilla Firefox a root certificate of Deutsche Telekom has to be installed in the web browser. Please visit
https://signet.telesec.de/caservice/Signet/T_Root_akz/index.html
and download the certificate. We recommend to accept all purposes offered for this certificate.

Background information on the warnings with Mozilla Firefox can be found on the web pages of DFN
http://www.pki.dfn.de/index.php?id=statusmozilla

(wwb)


Webserver Zertifikatwechsel am Donnerstag, 05. März 2009


Wegen Ablauf des bisherigen Sicherheitszertifikats des HLRN Webservers kommt es am Donnerstag, 05. März 2009 zum Zertifikatwechsel.

Um im Mozilla Firefox eine Fehlermeldung beim Aufruf unserer Seiten zu vermeiden, muss, am besten vorher, im Browser ein Wurzel-Zertifikat der Deutschen Telekom installiert werden. Dazu sollte das Download von der Seite
https://signet.telesec.de/caservice/Signet/T_Root_akz/index.html
durchgeführt werden. Dabei sollte man sinnvollerweise alle Verwendungszwecke bestätigen.

Hintergrundinformationen zu den Warnungen bei Mozilla Firefox findet man auf den Webseiten des DFN
http://www.pki.dfn.de/index.php?id=statusmozilla

(wwb)


GAMESS_US version January 12, 2009 R1 available at HLRN
[1297] Mar 02, 2009

The quantum chemistry package GAMESS_US version January 12, 2009 R1 for 64 bit IA64/x86_64 is now available at HLRN.

See https://www.hlrn.de/home/view/System/GAMESS_US for a description of the package.

(BB)



Berlin XE not accessible Thu Feb 19
[1293] Feb 19, 2009

On Thu Feb 19 2009 the XE cluster at HLRN site Berlin will be down
from about 09:00 until late afternoon.

No batch jobs are scheduled to these nodes during this time.

Update:

Since Thu Feb 19 13:15 the Berlin XE cluster is available again.

/Bka


HLRN Maintenance finished
[1291] Feb 12, 2009

Since Fri Feb 13, 2009 19:30 both HLRN sites are accessible by
users, again. Batch system production has been resumed.

Thank you for your patience.

/bka


HLRN Maintenance from Monday February 09 until Thursday February 12, 2009
[1285] Feb 05, 2009
In preparation for the installation of the HLRN "Phase 2" systems later this year both HLRN complexes will be undergoing maintenance from Monday February 09, 8:00 until Thursday February 12, 2009, 20:00. During the maintenance period the system software and several other components will be upgraded, most notably the Lustre file system. The availability of the complexes is as follows.
  • The Hannover complex HICE will be fully unavailable
    from Monday February 09, 8:00 until February 11
    from Thursday February 12, 16:00 until Thursday February 12, 20:00

  • The Berlin complex BICE will be fully unavailable
    from Tuesday February 10, 15:00 until Thursday February 12, 20:00.
Please plan accordingly.

(wwb)


No login to hd4 because of system upgrade
[1283] Feb 03, 2009

I have taken hicedata4 out of production for a system upgrade. Please use hd1-3 instead.

Gabriel Gaus



Solved: Hardware problems with tape robot system in Hannover
[1281] Feb 02, 2009

The tape robot has been repaired and everything should work again.

Gabriel Gaus


The original message was:
-------------------------
The tape robot in Hannover has a defect.

Backup/Restore of HLRN data is not possible at this time.
Access to SAM-QFS (/qfs1) will fail if data has already expired from disk cache.

The backup group is working on the problem.
A message will be posted when everything is working again.




iscsi swap space deactivated on ICE Hannover
[1279] Jan 22, 2009

On the ICE component of the HLRN-2 system, a minimal remote swap file of 50 MB is configured as default on each blade.

Over the last few months, we experienced intermittent availability of this file. Since we suspect that this causes various errors, we deactivated it today (Jan 22) at 15:40 to test whether this improves system stability.

Users should experience no impact on their jobs - the file only added 50 MBytes of virtual "RAM", which should not be used for jobs anyway.

Since this is an experiment, please do not hesitate to notify me if you experience any problems that you believe could be related to this.


Gabriel Gaus
gaus_at_rrzn.uni-hannover.de


Solved: Network problems between Complexes
[1275] Jan 22, 2009

A defective patch cable had to be replaced.

Thank you for your patience,

Gabriel Gaus


The original message was:
------------------------------
We currently experience severe network problems between the complexes. Internal communication between Berlin and Hannover is not possible and you will have to use the external interfaces. To move between complexes, please use the external interfaces instead:

[...]




mpiexec available for SGI MPT
[1273] Jan 19, 2009

HLRN provides a new startup procedure for MPI programs to users which use SGI MPT. After loading an mpt module file, the command mpiexec is in the users $PATH. It is intended to replace the command mpiexec_mpt, which is provided by SGI.

The mpiexec command simplifies the start of MPI executables inside of batch jobs.

It is intended to provide a common and handy interface for starting MPI executables that use MPI libraries such as MVAPICH2, Intel-MPI, SGI-MPT at HLRN by making most of the glory and specific details of different MPI libs transparent to the user.

See https://www.hlrn.de/home/view/System/SgiMpt for details.

/Bka



Solved: Home directories not available in Hannover
[1267] Jan 13, 2009

Home directories are available again in Hannover.

We apologize for the trouble.


Gabriel Gaus


The original message was:
-------------------------
The central home directory server hhome1 just
failed in Hannover. The hot standby server could
not be activated due to errors.

We are of course trying to solve the problem as
quickly as possible.

Gabriel Gaus






hicedata1/hd1 available again
[1263] Jan 06, 2009

hicedata 1 is available again and restore from tape should work as well.


Gabriel Gaus


The original message was:

Reboot of hicedata 1 required
---------------------------------------
Due to several file system availability problems over the last two weeks, we have a certain amount of hanging processes on hd1 which seem to have an impact on restore of HLRN-1 data. They can not be killed and require a reboot.

I will reboot hd1 tomorrow, Wed Jan 07th, at 10:00 am. Please log off by that time, processes still running will be killed without further notice.




Resolved: no login possible to hd1/hicedata1
[1260] Jan 05, 2009

Users experienced login problems on the data server hd1. They were due to Lustre difficulties (/gfs1) on that machine, which in turn caused user profile execution to hang.

I just remounted the file system and operation should be back to normal.


Gabriel Gaus



Printable | Top of Page | Sitemap                   
 
2001-2014 ©   HLRN Logo   Norddeutscher Verbund für Hoch- und Höchstleistungsrechnen (HLRN).     This site is powered by the TWiki collaboration platform