(in reverse chronological order)
It really isn’t much fun working with Windows workstations. I sometimes ask myself, what makes an operating that is so ill thought through so successful. I have some theories, but that is not the point of this post. I want to talk about a technology that kept me up five hours for a two hour job: NTFS Junction Points.
Even though NTFS supports symlinks, at first their support was limited. So in NTFS 3.0, Junction Points were introduced as well. NTFS 3.1 saw symlinks gain their full capability, which was released with Windows XP and made usable in Windows Vista. Junction Points are like symlinks, except that they are always absolute (they cannot be relative), and they must point to a destination in the same filesystem. Furthermore, their support in applications used for copying files is extremely limited, including Microsoft’s newest recursive copying program Robocopy.
As is the case, user profiles have a bunch of Junction Points in them (i.e.
Local Settings is a Junction Point pointing to
AppData\Local). One of them
is defined recursively:
AppData\Local\AppData points to
All of this comes together to form the following disaster:
If you are not aware of Junction Points and you attempt to copy a user profile,
even setting robocopy’s flag to copy symlinks as symlinks, the Junction Points
will copy their contents. That means that much of the user profile will be
duplicated, or even triplicated in different locations. That will break in
time, because different applications, accessing folders through the paths that
were Junction Points and those accessing the target folders directly, will see
different content as it is updated. But more immediate is the effect of the
Junction Point at
AppData\Local\AppData. It will cause an infinite copy
loop, resulting in a destination profile containing something along the lines
AppData\Local\AppData\AppData\AppData\AppData\.... If this is allowed to
run for a while before being noticed, the structure will be soo deep, that
Windows’ file delete algorithm can no longer process the file names because
their paths are too long. That’s just wrong, though. One file utility should
not be able to process paths that another can’t. Where’s the consistency? To
remove the mess requires an iterative process of setting ownership of the
AppData folder to oneself, navigating as deeply as possible, moving the
AppData subfolder out, erasing the original
AppData folder and repeating that process with the
AppData folder that
was moved out. This is so tedious in Windows, that it can create work for
upwards of an hour. Which, in fact, it did for me yesterday.
In the end, I was able to tell
robocopy to ignore all Junction Points
/xj) and then I recreated them by hand on the destination.
robocopy src dst /e /copyall /dcopy:dat /w:0 /sl /xj mklink /J "Local Settings" AppData\Local
Once the IPSec/IKEv2 VPN was up and running, I had just one change to make to the user profiles of all the users in our Active-Directory Domain: make the target of the users’ home directory as well as the targets of the connected network drives fully qualified (glasgow -> glasgow.ec-ws.de). The address resolves to an internal IP address that is routable via the VPN. The short form will only resolve when connected directly to the company network’s DHCP server.
The procedure was supposedly simple:
The user logs off his machine.
I programmatically replace all instances of
\\glasgow.ec-ws.de\in that user’s registry hive.
The user logs back in.
The user will happilly continue with his work.
Well, number 4 didn’t work out as planned. For three users (including myself), their network-based home directories were torn apart. Mine simply became empty, another’s was partially removed, and yet another’s was partially removed and had additional garbage. The rest of the users’ home directories appeared to have suffered no ill effects. I was able to restore the contents of the affected home directories from the previous night’s backup, and no great harm was done. But, I felt severely nervous that such an innocent action would have such devastating effects on a completely unrelated part of the system! That’s what life must be like as a Windows Sysadmin. You’re constantly afraid to touch anything, because it might break something else in unexpected, unpredictable, and catastrophic ways. Yuck!
So I warned all the users about the situation, suggesting that they might perform a simple check to make sure that their home directory is OK. A few minutes later, my boss, who had started his computer just to perform this check, calls me up and tells me that he can’t log in to his computer. It has been hanging on the “Welcome” screen while attempting to log in for between 5 and 10 minutes. “What now”, I think. “I finally got everything fixed.” OK. Off I go; I wanted to have a personal look at that stuck welcome screen. Indeed, I had seen that computer take two minutes to log the user in before, but never ten. I couldn’t hear or see any hard drive activity. I tried to nudge Windows into shutting down by pushing the power button—it went into StandBy mode, instead. But when I woke it up, it had finished the login process. Hurray! Maybe it just had some idiotic process to go through before allowing the user to log in after I adjusted those paths, and is now OK! Please?
This started the worst of it. Logging out and in again without turning the computer off, resulted in a login time of just over a minute. Logging out, shutting down the computer, turning it back on, and logging in again, gave me a login time of (I measured it) nine minutes and a few seconds! It was doing the same thing all over again! I just hoped that it would figure itself out if I left it running over night. I disabled automatic StandBy and left it there. My boss agreed to leave it at the office over night. I just have the gut feeling, that this system was completely re-sorting its Offline Files, and that it was doing so on every login. I just hope that it reaches a stable state, and can log in again without all of this extra work.
Later on, I got a message from a coworker who had by now tried to access one of her files on her laptop after going home via Offline Files, and was now greeted with the message that the server is not available. Happily, she was able to connect to the VPN, but that just showed her an obsolete state because her computer hadn’t managed to successfully synchronize the new files to the server using the Offline Files mechanism while still at the office (today, yesterday, who knows how much is missing). Yuck, yuck, yuck! These Offline Files that Microsoft invented are giving me real grief! It’s feels like trying to eat mold.
Here is my conclusion:
In my previous post, Why Active Directory Is More Trouble than it’s Worth, I had already complained about Offline Files. But now, I believe that most of the grief I’ve had with this particular Active Directory comes from that very well-intentioned and supremely dysfunctional technology.
For a stable network environment with central user management compatible with Windows machines and a central data store for both network shares and user directories, I envision the following: No. Offline. Files!
Here are the key configuration items:
Local User Profiles
Local User Home Directories
Periodic One-Way Synchronization of the user data (profiles and home directories) to the file server, managed by the file server (smb mounts with
rsyncand a discovery mechanism which system contains the currently used user data).
When logging in to a different machine, the User Data can be manually retrieved. There is no automatic synchronization to workstations.
This brings several important advantages:
The workstation behaves more like the user’s expect. The files in their Desktop, Documents, Pictures, etc. folders are stored locally, and are always available. They do not depend on the availability of any network services.
The users instinctively know what is on the file server (and what they won’t have access to when disconnected) because they access it through mapped drives or the Network Neighborhood.
Log ins and log outs are fast because no user profiles have to be read from or written to the file server on every login or logout. This is especially apparent for large profiles (our largest one is 3.7GB at the moment). That’s a lot of data movement every time my boss logs in.
Safe shut downs. When shutting down using roaming profiles, it often happens that the network drivers are unloaded before the profile is completely copied back to the file server. That causes more headaches than a crying baby that wakes you up five times every night! With local profiles there is no data to write out before the network hardware is deactivated.
Up-to-date file server state. The need for Offline Files is eliminated by having the user data local, as the primary (and automatically enabled) job of Offline Files is making networked home directories available on the machine while disconnected from the network (see Why Active Directory is More Trouble than it’s Worth for this point, too). Without Offline Files, there is no two-way synchronization process to fail unnoticed. Reads and writes are always performed directly against the file servers shares.
And some inconvenient, but, relatively and generally speaking, not as important, disadvantages:
Users can’t just log in to another machine without manually setting up their user data on that machine.
Users can’t configure folders on the network shares to be available and synchronized automatically. Either they have to copy it to their computer before they leave the network, or they need access to it via a VPN. This point is mitigated almost entirely by the fact that Offline Files are so unreliable, that most people I have observed will make copies onto USB sticks because they don’t know what will and what won’t be available when they’re disconnected.
All-in-all, keeping the synchronization algorithm known as Offline Files out of my network, seems to be an important simplification of the system that should greatly improve stability.
While debugging the IPSec/IKEv2 VPN mentioned below, I
had to deal with the situation that full response packets (size == MTU) were
not delivered to the client. I checked all tables of
iptables on the router
to ensure that the paket was indeed leaving the router destined for the
external (NAT) address behind which the client would receive the paket. Doing
so, I verified that the router properly reported a reduced MTU of 1438 Bytes
when forwarding packets over the tunnel. However, starting with a size of 1391
Bytes, the pakets never arrived at the client. Therefore, I am assuming that
some router on the way to the client added some more headers (probably for
NAT), but didn’t send back a “ICMP Fragmentation Needed” or “ICMPv6 Packet Too
Big” ICMP message. That puts me in the situation of dealing with a PMTU Black
Hole Router that is outside of my control. Here is my solution:
Configure the VPN router to apply a conservative TCP MSS policy to all pakets destined for the VPN.
root@leeds:~# iptables -t mangle -A FORWARD -i eth1 -s 172.25.0.0/16 -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 1350 root@leeds:~# iptables -t mangle -A FORWARD -o eth1 -d 172.25.0.0/16 -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 1350 root@leeds:~# iptables-save -t mangle # Generated by iptables-save v1.4.21 on Wed Dec 21 17:41:35 2016 *mangle #* (manually added for vim's syntax highlighting) :PREROUTING ACCEPT [15292:8687444] :INPUT ACCEPT [4972:1063701] :FORWARD ACCEPT [10279:7619033] :OUTPUT ACCEPT [4808:1177850] :POSTROUTING ACCEPT [15087:8796883] -A FORWARD -s 172.25.0.0/16 -i eth1 -p tcp -m tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 1350 -A FORWARD -d 172.25.0.0/16 -o eth1 -p tcp -m tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 1350 COMMIT # Completed on Wed Dec 21 17:41:35 2016
Alternatively, I could have used the
policy module to match ipsec pakets,
but I chose to match based on the destination network for consistency with
other rules. The alternative would have been as follows:
iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN -m policy --pol ipsec --dir in -j TCPMSS --set-mss 1350 iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN -m policy --pol ipsec --dir out -j TCPMSS --set-mss 1350
It should be noted that this only takes care of TCP connections. Other protocols are completely on their own. That especially puts the commonly used UDP and ICMP at risk.
It would also be possible to configure the clients to handle these situations themselves, but this is too far out of my control and requires non-standard configurations on too many devices. For cempleteness sake, though, this method would work as follows:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip Parameters\EnablePMTUBHDetect = 1
This parameter is optional and non-existent by default and needs to be added. It can be disabled by setting it to 0 or removing it.
This may or may not help:
echo 1 > /proc/sys/net/ipv4/ip_no_pmtu_disc
As part of my course of study, I am implementing an IPSec/IKEv2 VPN for the eCommerce Werkstatt. Doing that, I have encountered a number of difficulties that required some research effort to overcome. One of those difficulties was that I could ping resources in the network from a connected client and receive an answer. But I couldn’t ping the connected client from within the network. The paket never reached the client, and consequently it didn’t send back any answer.
The problem was that my SNAT rules in
iptables were interfering with the
relevant IPSec routing policies in the kernel’s Security Policy Database (SPD).
These can be seen with the following command:
root@leeds:~# ip xfrm policy src 172.25.5.251/32 dst 172.25.0.0/16 dir fwd priority 1859 tmpl src 220.127.116.11 dst 18.104.22.168 proto esp reqid 24 mode tunnel src 172.25.5.251/32 dst 172.25.0.0/16 dir in priority 1859 tmpl src 22.214.171.124 dst 126.96.36.199 proto esp reqid 24 mode tunnel src 172.25.0.0/16 dst 172.25.5.251/32 dir out priority 1859 tmpl src 188.8.131.52 dst 184.108.40.206 proto esp reqid 24 mode tunnel
The interference was that my SNAT rules rewrote the source address of all
pakets destined to leave on the external interface of the router to its
external address, including those that should go through the IPSec tunnel. As a
consequence the policy that should catch those packets,
src 172.25.0.0/16 dst
172.25.5.251/32, no longer matched (the source IP was wrong—for details, see
the netfilter graph). Once I had
seen all of that, the solution was simple. Add an exception to the SNAT (and
root@leeds:~# iptables -t nat -I snat -d 172.25.0.0/16 -j RETURN root@leeds:~# iptables -t nat -I dnat -s 172.25.0.0/16 -j RETURN root@leeds:~# iptables-save -t nat # Generated by iptables-save v1.4.21 on Wed Dec 21 14:05:34 2016 *nat #* (manually added for vim's syntax highlighting) :PREROUTING ACCEPT [207947:16230768] :INPUT ACCEPT [54254:3941136] :OUTPUT ACCEPT [77049:5382062] :POSTROUTING ACCEPT [53957:4927218] :dnat - [0:0] :snat - [0:0] -A PREROUTING -i eth1 -j dnat -A POSTROUTING -o eth1 -j snat -A dnat -s 172.25.0.0/16 -j RETURN -A dnat -p tcp -m tcp --dport 5060:5061 -j DNAT --to-destination 172.25.1.7 -A dnat -p tcp -m tcp --dport 22001 -j DNAT --to-destination 172.25.1.17:22 -A dnat -p udp -m udp --dport 5060:5061 -j DNAT --to-destination 172.25.1.7 -A dnat -p udp -m udp --dport 16384:32767 -j DNAT --to-destination 172.25.1.7 -A snat -d 172.25.0.0/16 -j RETURN -A snat -s 172.25.1.7/32 -p tcp -m tcp --sport 5060:5061 -j SNAT --to-source 220.127.116.11:5060-5061 -A snat -s 172.25.1.7/32 -p udp -m udp --sport 5060:5061 -j SNAT --to-source 18.104.22.168:5060-5061 -A snat -s 172.25.1.7/32 -p udp -m udp --sport 16384:32767 -j SNAT --to-source 22.214.171.124:16384-32767 -A snat -s 172.25.1.17/32 -p tcp -m tcp --sport 22 -j SNAT --to-source 126.96.36.199:22001 -A snat -p tcp -m tcp --sport 1:511 -j SNAT --to-source 188.8.131.52:1-511 -A snat -p tcp -m tcp --sport 512:1023 -j SNAT --to-source 184.108.40.206:512-1023 -A snat -p tcp -m tcp --sport 1024:5059 -j SNAT --to-source 220.127.116.11:1024-5059 -A snat -p tcp -m tcp --sport 5060:65535 -j SNAT --to-source 18.104.22.168:5062-65535 -A snat -p udp -m udp --sport 1:511 -j SNAT --to-source 22.214.171.124:1-511 -A snat -p udp -m udp --sport 512:1023 -j SNAT --to-source 126.96.36.199:512-1023 -A snat -p udp -m udp --sport 1024:5059 -j SNAT --to-source 188.8.131.52:1024-5059 -A snat -p udp -m udp --sport 5060:16383 -j SNAT --to-source 184.108.40.206:5062-16383 -A snat -p udp -m udp --sport 16384:65535 -j SNAT --to-source 220.127.116.11:32768-65535 -A snat -j SNAT --to-source 18.104.22.168 COMMIT # Completed on Wed Dec 21 14:05:34 2016
Of course I made the changes permanent in my firewall setup configuration of
iptabels-persistent). And what do you know, pakets are now properly
transmitted in both directions.
Next challenge: Why does PMTUD not work for outgoing pakets with sizes above 1390 and up through 1422 Bytes?
Several problems kept creeping up during the use of the Active-Directory Windows Domain we were using here at eCommerce Werkstatt:
The settings defined in the GPO would not be applied by a few workstations.
When settings were changed in GPO they would often not be applied by many workstations.
Roaming Profiles on network shares broke a bunch of programs, including Microsoft Office, even though that is precisely why AppData is split up into Roaming, Local, and LocalLow. The official Microsoft documentation explicitly suggests redirecting that folder.
The firefox folder in Roaming Profiles kept breaking sporadically for Paul (and perhaps for Anne).
Offline Files would not reliably synchronize all files. Anne often found files missing.
Offline Files would report conflicts and errors that should not have happened.
Yesterday I fought with our workstation that goes by the name of belfast to set up an e-mail account in Outlook. Thereby I discovered that the GPO settings for the user were not being applied. I started reading up on similar issues other people were having and discovered that Windows does not close locks held on network files before closing network sockets when shutting down 1 2. That means that if users shut down the machine without going through the extra step of logging out explicitly beforehand, they will encounter problems if they log in again before the servers lock timeout is reached (10-15 minutes). In a way this was the last straw for me. I have now come up with a list containing a number of ways that Windows implemented and continues to implement Domain Administration badly:
GPO settings are not ensured to arrive on a workstation. It is up to the workstation to pick them up and apply them. If, for any reason, it doesn’t want to, or has trouble doing so, that was it. No settings. This is horribly flawed, as I want to have some assurance that my settings reach their destination.
GPO pickup by workstations is pretty unstable and can error out for any number of reasons. These errors are often not reported at all (not in the Events Log and not with a popup message). In the few cases that they are reported, the error message does not mention what went wrong, nor does it say what part of the configuration caused the error. This leaves me, the administrator, playing a long and arduous guessing game at where to start resolving the issue. Trying a hundred things before accidentally fixing the error leaves me not knowing what was wrong and it also leaves me changing configurations back and forth that further destabilize the environment. Not to mention, it wastes days of my time.
Administration utilities like
certutilis documented poorly, and in some respects just plain wrong: Removing certificates that are expired on a certain date should work with
certutil -deleterow <date> Cert. The documentation,
certutil -?, doesn’t mention whether the command will apply to all certificates prior to that date or just the ones that contain that date. The wording sounds more like it would only apply to certificates with that date, but online HowTos 3 4 5 6 7 8 state that it applies to all certificates expired before that date. It also doesn’t mention whether the expiration date or the issue date is compared. The same HowTos speak of the expiration date being the one being compared, which makes the most sense. On my systems (Windows 10), the command always comes back with an error message stating that the corresponding file could not be found, regardless whether I choose a date after or on the certificate’s date of expiration. The HowTos mention another way to use
certutil, but that method ends up giving me the same error. Furthermore the documentation makes no mention of the format for
<date>. In fact, it must be entered in the localized format (10/26/2016 in the US, 26.10.2016 in Germany, 2016-10-26 in Denmark. etc.), which leads to a lot of confusion 9. It seems that
certutilis partially broken in Windows 10 without any care to update the documentation. The same documentation is also written so superficially as to be almost no help to the Administrator wanting to apply the tool.
This kind of gives me the impression that Windows administration utilities are developed, or at least maintained, with a lack of consideration for the administrator, me. I don’t feel flattered!
Now, If i am to rid this network, and all future networks that I may have the pleasure af administering, of the broken pesky and time consuming Windows Domain, I will need to have a solution for the following tasks at hand:
How can I apply settings to a Windows workstation without using GPO? (msi Files, zabbix agent)
How can I enable working security policies and reliable file access without consistent SIDs and UIDs between machines? (username/password based file access)
How can I make a user’s profile be available automatically when logging in to a new machine? (logoff and logon scripts that perform a differential sync with a central profile store)
How can I make a user’s profile available for backup? (same as previous)
How can I have software be automatically installed?
These solutions will need to be developed, and I hope to be able to post them here as I progress.
I started to work at the eCommerce Werkstatt back in November of 2013. What I found there was a Windows server running an Active-Directory Windows Domain for a bunch of Windows workstations. There was a second server running Linux that was used for file backups and a few internal websites.
The whole thing was unreliable. The backups failed without notice. The backup (performed by Windows server Backup Service) kept growing with each new diff without a chance to get rid of old ones. Roaming Profiles would regularly time out during logout synchronization, so that users would get back old files that they had moved or removed and lose files that they had added when they logged in the next day.
So I decided I wanted to have a maintainable server and most of the Roaming Profiles removed to network shares to reduce my headaches and problems with this Windows Domain. I set up a series of virtual servers using Samba4 (redundant domain controllers and a file server). I also configured the GPO to use folder redirection liberally: Deskop, Documents, etc. were all redirected to a home directory provided for each user on the file server. Initially I had AppData/Roaming redirected there as well, but I changed that later on (see my next post).
The Samba4-based Active-Directory Windows Domain helped me with the following tasks and goals:
GPO enabled a central configuration of Windows workstations.
Domain Users could be configured centrally for all workstations, including Linux workstations.
Domain Users enabled access to network files based on globally valid security settings without needing any further login credentials.
Domain Users enabled consistent user mappings for file ownership.
Roaming Profiles enabled users to sign in to any workstation in the domain.
Backups were much more reliable and space efficient operating from a Linux file server.
Folder Redirection made the Roaming Profiles smaller and prevented timeouts from occurring during logout.