I suffer from having too many open tabs.
I think I should start dumping links into webpages like these and begin trying to treat my computers more like the desktops of old. People used to shut down their systems entirely after every use. I have a friend who actually still does this with his very modern laptop.
Sleep modes and desktop restoration on my daily driver operating systems has allowed me to enter into a state of rot with respect to the number of open tabs I have.
By forcing myself to go through them and take action, I should in theory be able to have more focus and intention when I approach my computer.
I think one of my problems now is that there is so much I could do, it's hard to pick the right thing at any given time and so often enough I just end up doing nothing at all.
I tried to access one of my server pages only to find it completely unresponsive.
I cannot tell what caused a dramatic spike in load starting early this morning.
It was gradual but substantial.
I was unable to run commands as basic as docker ps
, something was locking up the computer.
I am unsure of the root cause (perhaps it was NFS, that's the hypothesis), but what follows are the things I encountered in restoring service.
For reference, my services on this server are:
littlelink
servers (testing) serving up links to my socials / projectsapache2
file-server for files I want to access from anywhere, sort of the start of a "global directory for Michael" but really for now just a convenient way to share things with friends.jupyterhub
so that I can access compute from anywhere (minimal resources but enough for basic data processing), and so that my friend can have a stable python environment for occasional work-automation tasks (a recent development but this sort of use-case is what I originally deployed this for)photoprism
mounted on s3fs
-mounted DigitalOcean buckets. Need to migrate this to locally hosted at home.mariadb
for photoprismpostgres
for jupyterhubgitea
(git.mlden.com
) for public-facing github stuffmailserver
(mlden.com
, clfx.cc
) email server, which I would very much like to secure through Cloudflare eventually as well.Of these, nginx
, littlelink
(s) and apache2
failed to restart, and photoprism
was inaccessible because I forgot to migrate it to Cloudflare after my recent clfx.cc
DNS changes.
The littlelink
servers had an unless-stopped
restart policy, so I'm not sure why they didn't come back up, but I manually re-started them.
I thought about changing the policy to always
but I want to see if they come back up on their own the next time I restart docker (they did, hopefully on reboot they will too).
It may have been the NFS server I configured at fault, with DNS issues at home leading to failing autossh
tunnel connections outbound to register the NFS drive at home as a local one for the server.
I managed to reboot the server using the DigitalOcean console but then it failed to turn back on.
What happened was the /etc/fstab
file was trying and failing to mount the NFS drive (despite the line being commented out with a #
), so I had to boot from a recovery ISO, edit the file by adding another #
character in front of the NFS-related line, and then was able to boot from the hard drive and start diagnosing my server.
The restart was very bad though, as many services did not come back alive properly (most, including the most important one... nginx
).
But at least I had a responsive shell, working ssh
access from my laptop, and docker ps
was working again.
Soon enough I was able to diagnose the problem of unfinished re-configuration of some of my services being migrated to Cloudflare.
I left things in an incomplete state and as a result nginx
failed to restart.
I registered px.clfx.cc
and files.clfx.cc
, pointed them to the new IP addresses, and brought those services back online through Cloudflare.
I ended up using the docker-supplied IP address with docker inspect <container_name> | grep IPAdd
in the configuration and pointing to the internal container ports, despite binding to external ports as well (I was debugging).
However, in trying to bring up my apache web-server on 8080
(to bind to localhost:8080
in Cloudflare), I found another unexpected bug...
For some reason port 8080
was suddenly occupied when I expected it to be free as well.
I got this error:
ERROR: for webserver Cannot start service server: driver failed programming external connectivity on endpoint webserver (<container_id>): Bind for 0.0.0.0:8080 failed: port is already allocated
Running doas netstat -tulpn | grep 8080
yielded nothing.
I also tried this:
mm@cloud:~/apache-server$ doas ss -lptn 'sport = :80'
doas (mm@cloud) password:
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
LISTEN 0 4096 0.0.0.0:80 0.0.0.0:* users:(("docker-proxy",pid=33534,fd=4))
LISTEN 0 4096 [::]:80 [::]:* users:(("docker-proxy",pid=33540,fd=4))
mm@cloud:~/apache-server$ doas ss -lptn 'sport = :8080'
doas (mm@cloud) password:
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
I also found pip install whatportis
which has "common usage" of the ports.
Neat but not helpful.
The mystery persists...
< some time later >
Since I could not find a service running on the port, I figured it couldn't hurt to try seeing what was running.
curl localhost:8080
led to a "Connection Refused" message, but curl localhost:80
showed me my expected homepage.
That gives me a hypothesis: perhaps nginx
is installed on the system and upon failing to start on port 80 due to a docker container binding to that port, nginx
defaulted to 8080
instead as an alternative default HTTP port.
curl localhost:<other port>
also shows "Connection Refused", and nginx
and apache2
are not installed on the computer.
I got tired of trying to start apache2
via docker-compose.yml
and changed my debugging strategy:
docker run --rm -ti -p 8080:8080 python:3.9 python -m http.server 8080
And that worked... So my problem with the apache2
container must be something else.
The bind I am using is 8080:80
from the docker-compose
file.
I tried my debugging command for the same network with the --network=file-server
flag.
I don't know what worked but suddenly I could bind to 8080
again... At some point I did restart docker
and did try to kill processes in htop
that were found by filtering for the port number.
I wish I thought of the better debugging step earlier but oh well, everything seems to be working now.
I made sure the webserver
(apache2
) service would not have a restart policy.
I still want to go through logs to figure out what was bogging down the server, checking on the DigitalOcean dashboard reveals that things are looking okay now.
One thing I did disable was the rsync
job from my laptop to the server.
I want to keep that disabled and sync journal entries manually for the time being while keeping an eye on how the droplet continues to perform.
Some to-do's
- set up overleaf
on home server
- migrate photoprism
to home server
- set up Cloudflare access via nginx
(with htpasswd -c .htpasswd <username>
to create and -v
to verify)
- have nginx
direct to local services
- do the above for the services in mlden.com
after migrating DNS to Cloudflare.
In looking at my log files from gitea (referenced below), I found that fenics
and super
stopped syncing because of expired github tokens.
Later that day I updated the page with the new token I generated (I was reminded when I came across the still-open Github tab in trying to clean up my browser at the end of the evening).
If I recall correctly, I have been keeping logs on server access attempts on my io
server for quite some time.
I should save this data and do some analysis on it to see if the migration to Cloudflare has resulted in a change in attacks.
Unfortunately since the server restarted recently, I am not sure I actually have the history of attempts or not.
docker logs gitea
reveals that time-stamps are incomplete for SSH attempts.
The fact that time-stamps show up for some interactions means that I can at least get an inconsistent time series observation, with data going back to 2022-05-06. I captured the log today and hosted it on my server here
I can see that I do have daily data (at least one observation per day) for the two week period by looking at the parse of
cat .gitea-05-20.log | grep 2022/05/ | awk '{ print $1 }'
I do observe attacks as recent as yesterday, so perhaps I need to close port 22 in my router configuration (apparently I already did, though I forgot to close my secondary SSH port which was rendered useless anyway). That reminds me that I now will not be able to access my studio server without a VPN or memorizing its IP address and re-opening the other SSH port (it is probably better to leave it closed to the world entirely).
If my studio server's IP address is no longer in DNS records, how is it that I can still receive attacks?
On my cloud server I have logs going back three months earlier and have not yet migrated that to Cloudflare. I went ahead and saved this file and uploaded it to here.
I can definitely develop the code on the smaller dataset and then apply it to my cloud server, and if I make the change on June 2, I would have exactly four months of data prior to the changepoint of switching to Cloudflare.
I loved my first experiments with pyscript (mud research), but didn't get around to handling interactions between the DOM and Python. I finally looked it up and have saved an example from this stack overflow post along with an interactive todo list which relies on a python file to embed the code into more deeply nested HTML than the basic example I put together (also looks like styling is possible!).