User Tools

Site Tools


nnm:cloud_computing

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
nnm:cloud_computing [2024/01/03 16:16]
stefan.birner [Submitting jobs to HTCondor pool with nextnanomat]
— (current)
Line 1: Line 1:
-======nextnano.cloud====== 
- 
- 
- 
- 
-==== Useful HTCondor commands for the Command Prompt ==== 
-  * ''​condor_submit <​filename>​.sub''​ Submit a job to the pool. 
-  * ''​condor_q''​ Shows current state of own jobs in the queue. 
-    * ''​condor_q -nobatch -global -allusers''​ Shows state of all jobs in the cluster. Of all users. 
-    * ''​condor_q -goodput -global -allusers''​ Shows state and occupied CPU of all jobs in the cluster. 
-    * ''​condor_q -allusers -global -analyze''​ Detailed information for every job in the cluster. 
-    * ''​condor_q -global -allusers -hold''​ Shows why jobs are in hold state. 
-  * ''​condor_status''​ Shows state of all available resources. 
-  * ''​condor_status -long''​ Shows state of all available resources and many other information. 
-  * ''​condor_status -debug''​ Shows state of all available resources and some additional information,​ e.g. //WARNING: Saw slow DNS query, which may impact entire system: getaddrinfo(<​Computername>​) took 11.083566 seconds.// 
-  * ''​condor_rm''​ Remove jobs from a queue: 
-    * ''​condor_rm -all''​ Removes all jobs from a queue. 
-    * ''​condor_rm <​cluster>​.<​id>''​ Removes jobs on cluster <​cluster>​ with id <id> (It seems ''<​cluster>​.''​ can be omitted, and ''​id''​ is the ''​JOB_IDS''​ number.) 
-  * ''​condor_release -all''​ If any jobs are in state hold, use this command to restart them. 
-  * ''​condor_restart''​ Restart all HTCondor daemons/​services after changes in config file. 
-  * ''​condor_version''​ Returns the version number of HTCondor  ​ 
-  * ''​condor_store_cred query''​ Returns info about the credentials stored for HTCondor jobs 
-  * ''​condor_history''​ Lists the recently submitted jobs. If for a specific job ''​ID''​ the status has the value ''​ST''​=''​C'',​ then this job has been completed (''​C''​) successfully. 
-  * ''​condor_status -master'':​ returns Name, HTCondor Version, CPU and Memory of central manager 
-  * Open Command Prompt ''​cmd.exe''​ as Administrator. Type in: ''​net start condor''​. This has the same effect as restarting your computer, i.e. the networking service ''​condor''​ is started. This is useful if you have changed your local ''​condor_config''​ file. 
- 
-==== Configuration options for the Central Manager computer ==== 
-With this option in the ''​condor.config''​ file on the central manager, one can set a policy that the jobs are spread out over several machines rather than filling all slots of one computer before filling the slots of the other computers. 
-<​code>​ 
-##------nn: SPREAD JOBS BREADTH-FIRST OVER SERVERS 
-##-- Jobs are "​spread out" as much as possible, 
-##   so that each machine is running the fewest number of jobs. 
-NEGOTIATOR_PRE_JOB_RANK = isUndefined(RemoteOwner) * (- SlotId) 
-</​code>​ 
- 
-==== FAQ ==== 
-**Q**: I submitted a job to HTCondor, but nothing happens. nextnanomat says "​transmitted"​. 
- 
-**A**: It could be that nextnanomat does not have read in all required settings. You can try to type in the command line ''​condor_restart''​. Please make sure that you entered your credentials using ''​condor_store_cred add -debug''​. You should then start nextnanomat again. 
- 
-**Q**: I submitted a job to HTCondor, but the Batch line of nextnanomat is stuck with ''​preparing''​. What is wrong? 
- 
-**A1**: Did you store your credentials after the installation of HTCondor? If not, enter ''​condor_store_cred add''​ into the command prompt to add your password, see above (Recommended Installation Process). 
- 
-**A2**: Did you change your password recently? If yes you have to reenter your credentials for HTCondor. 
-Enter ''​condor_store_cred add''​ into the command prompt to add your password, see above (Recommended Installation Process). If this does not work, try to enter ''​condor_store_cred add -debug''​ for more output information on the error. 
- 
-**Q**: I specified target machines in Tools - Options. Afterwards every submitted job to HTCondor is stuck with ''​transmitting''​. What is wrong? 
- 
-**A**: The value for ''​UID_DOMAIN''​ within the condor_config file needs to be the same for every computer of your cluster. (You can easily test it in a command prompt with ''​condor_status -af uiddomain''​) If it's not the same value, no matching computer will be found and the job won't be transmitted successfully. 
- 
- 
-==== Problems with HTCondor ==== 
-=== Error: communication error === 
-If you receive the following error when you type in ''​condor_status''​ 
-<​code>​ 
-C:​\Users\"<​your user name>">​condor_status 
-Error: communication error 
-CEDAR:​6001:​Failed to connect to <​123.456.789.123>​ 
-</​code>​ 
-you can check whether the computer associated with this IP address is your HTCondor computer using the following command. 
-<​code>​ 
-nslookup 123.456.789.123 
-</​code>​ 
-It is also a good idea to type in 
-<​code>​ 
-nslookup 
-</​code>​ 
-This will return the name of the Default Server that resolves DNS names. 
-If it is not the expected computer, you can open a Command Prompt as **Administrator** and type in ''​ipconfig /​flushdns''​ to flush the DNS Resolver Cache. 
-<​code>​ 
-C:​\Users\"<​your user name>">​ipconfig /flushdns 
-</​code>​ 
-If the DNS address cannot be resolved correctly it could be related to a VPN connection that has configured a different default server for Domain Name to IP address mapping. 
-E.g. if your Windows Domain is called contoso.com (which is only visible within your own network and your own HTCondor pool) but your DNS is resolved to www.contoso.com (which might be outside your local HTCondor pool). 
- 
- 
-=== Error: ''​condor_store_cred add''​ failed with ''​Operation failed. Make sure your ALLOW_WRITE setting include this host.''​ === 
-Solution: 
-Edit ''​condor_config''​ file and add host, i.e. local computer name (here: nn-delta). 
-<​code>​ 
-    ALLOW_WRITE = $(CONDOR_HOST),​ $(IP_ADDRESS) 
-==> ALLOW_WRITE = $(CONDOR_HOST),​ $(IP_ADDRESS),​ nn-delta 
-</​code>​ 
- 
-=== Error? Check the Log files === 
-If you encounter any strange errors, you can find some hints in the history or Log files generated by HTCondor. 
-You can find them here: 
- 
-''​C:​\condor\spool''​ 
-  * history 
- 
-''​C:​\condor\log''​ 
-  * CollectorLog 
-  * MasterLog 
-  * MatchLog 
-  * NegotiatorLog 
-  * ProcLog 
-  * SchedLog 
-  * ShadowLog 
-  * SharedPortLog 
-  * StarterLog 
-  * StartLog 
-More details can be found here: [[https://​htcondor.readthedocs.io/​en/​v8_9_3/​misc-concepts/​logging.html|Logging in HTCondor]] 
- 
- 
  
nnm/cloud_computing.1704295010.txt.gz ยท Last modified: 2024/01/03 16:16 by stefan.birner