Date: 2023-05-10, Version: 1.1
What's actually inside the Austrian internet?
How much? How big? What exactly? Where exactly? By whom?
In honor of the Austrian Staatsfeiertag, I collected 1.3 million .at domains, searched for their corresponding IP addresses, scanned important network ports, made DNS queries, retrieved home pages, and saved HTML, HTTP headers, and cookies. In total, 56 gigabytes of data were gathered, and analyzed extensively afterward. Here's what I found.
This site contains the results and an overview of the Austrian internet (only .at domains), as well as the sources and methods that can be used to obtain information from public data. If you want to learn more about future customers, potential partners, and persistent competitors, you often just need to know where and how to search for them on the internet.
Wikipedia: The phrase "Tu felix Austria" is said to describe Austrians as having a particularly happy disposition or way of life. It was first used, probably by Duke Rudolf IV, on his seals in 1364...
Which came first, the chicken or the egg? Where is the best place to start? There are several possible starting points. The .at domains seemed the most promising to me. It would also have been possible to use Austrian IP addresses, but this would have led to much poorer results, as will be seen in the IP address section.
What about Austrian websites that use .com, .net, .org, and other generic top-level domains? It's true that only a fraction was examined here. Nevertheless, .at domains are very popular and used by many companies and individuals. Finding out which other domains belong to Austria is a project for another day.
Too long; didn't read! For those who don't have time to read everything, here's a brief overview of the sections and key findings from each area.
Content: What domains exist in Austria? How to find domains? What domains are important and what can be learned from domains?
Content: What are IP addresses? How to find IP addresses? Where are the locations of the addresses and which IP is responsible for which domains?
Content: What are network ports? How to scan them? Which IP addresses have which ports open and what insights can be gained from this?
Content: What DNS entries are there and what can be learned from them? Which domains can even work? Which technologies (mail providers, service desks, marketing tags, ...) and cloud providers are often used? Which mail providers are the largest?
Content: What are HTTP headers? What can be read from headers? What are commonly used headers? How often are web pages updated? Who else uses cookies?
Content: What is HTML? What can be found in HTML? How big is the HTML? Which tags are often used? Which content is included externally? How SEO-optimized are the pages? Who has included microformats? Evaluations of inbound/outbound links. Who uses images and which ones?
Content: Who has an imprint? What languages are used on websites?
It all starts with a domain. An address that can be entered into a browser and, in the best case, leads us to a fast, beautiful, well-made website with great content. A complete domain is called a Fully Qualified Domain Name (FQDN).
Web addresses, also called URLs (Uniform Resource Locator), start with the protocol (e.g. http://), followed by zero, one, or more subdomains (e.g. www), and then comes the domain name and the Top-Level Domain (TLD). Everything separated by dots. The domain name is usually a Second-Level Domain (SLD).
The country-specific Top-Level Domain (ccTLD) for Austria is "at". Not every website hosted in Austria or targeting people in Austria needs to have a .at domain. However, many do. 1.5 million, according to official nic.at statistics from April 2023.
Protocol | Subdomain | Domain (Second-Level-Domain) | Top-Level-Domain |
https:// | orf | at | |
https:// | tvthek | orf | at |
nic.at is the company responsible for the allocation and management of .at domains. In addition to the .at TLD, .co.at and .or.at domains can be registered. The "co" stands for "Commercial" and "or" for "Organization". These SLDs are the equivalent of the international models .com and .org. However, the two SLDs are not nearly as popular as .at domains. According to nic.at statistics, 33,000 .co.at domains have been registered so far, and only 8,000 .or.at domains.
There are several other Second-Level Domains that are not directly managed by nic.at. The .gv.at domain is intended for government agencies and is managed by the Federal Ministry of Finance (see GV.AT domain management). For academic institutions, there is the .ac.at domain, which is assigned or managed by the University of Vienna (see ACOnet).
Protocol | Subdomain | Domainname | Public Suffix + TLD |
https:// | help | gv.at | |
https:// | augustin | or.at | |
https:// | www | univie | ac.at |
In addition, there are SLDs operated by private companies or associations. These domains are often treated as TLDs in browsers (e.g. for cookies and the address bar), which is why the non-profit organization Mozilla maintains a list where operators can register.
Currently, there are 19 SLDs for .at registered in the Mozilla Public Suffixes List:
The following analyses refer to the official five TLD/SLDs (at, or.at, co.at, gv.at, ac.at). Analyses for other SLDs can be found in specific sections (e.g. priv.at).
With the Github package crwlrsoft/url from crwl.io, you can easily parse domains or URLs, taking into account the public suffixes and supporting IDNs.
use Crwlr\Url\Url;
$url = Url::parse('https://www.domain.gv.at');
var_dump($url->domainSuffix());
There is no publicly available comprehensive list of all .at domains like there is for other TLDs (Sweden, Switzerland, Centralized Zone Data Service). Therefore, one must create a comprehensive list from various sources. There are both commercial providers and open-source projects for domain lists. Additionally, I used various methods to find additional domains in search results and public log files.
For .ac.at and .gv.at, I requested a comprehensive list from the responsible authorities. These requests were rejected on the grounds of data protection. Whose data is being protected there, anyway? Never mind.
A good source for new domains are the "Certificate Transparency Logs". CT Logs can be publicly viewed and are actually intended to bring more transparency to the issuance of security certificates. Since almost all websites are now delivered over HTTPS, each of these websites needs a certificate, which can then be found in these logs.
at | co.at | or.at | gv.at | ac.at | |
---|---|---|---|---|---|
nic.at Statistics [1] | 1.458.656 | 33.773 | 7.841 | unknown | unknown |
domainsproject.org | 954.141 | 13.761 | 2.928 | 1.239 | 1.261 |
ViewDNS.info | 1.187.203 | 28.454 | 6.492 | 1.689 | 1.397 |
domains-monitor.com | 462.553 | 8.094 | 1.717 | 1.023 | 1.054 |
staedtebund.gv.at [2] | 2.106 | ||||
bing Web Search API [3] | 134 | 109 | |||
Short Domain Checker [4] | 19.697 | 402 | 176 | ||
Cert Transparency (CT) Log Monitoring [5] | 18.659 | 294 | 69 | 55 | 70 |
DMOZ Export (2016) [6] | 17.884 | 426 | 141 | 228 | 174 |
Sources:
By combining all sources, one can obtain numbers that are quite close to the public statistics provided by nic.at. Achieving 100% accuracy is hardly possible, as new domains are constantly being registered, and many remain unknown if nothing is done with them.
Anyway, I work with what I have.
1.317.549 .at-Domains.
at | 1.277.059 |
---|---|
co.at | 29.458 |
or.at | 6.770 |
gv.at | 2.794 |
ac.at | 1.468 |
Total | 1.317.549 |
The longest domains I have found are 63 characters long. This is most likely due to the label length restriction of 63 characters between two dots. Otherwise, people would certainly pack even more keywords into the domain or do something silly with it. I found seven .at domains with 63 characters.
The zzzz... domain also exists for ac.at and would therefore be the longest domain if you add the five characters of the TLD. This domain appears to be a test domain of the University of Vienna and currently does not serve any web content (HTTP/HTTPs).
Side note: I asked ChatGPT (3.5-turbo) about the longest known .at domain, and the AI was pretty sure that it had to be donau-dampfschiffahrtselektrizitätenhauptbetriebswerkbauunterbeamtengesellschaft.at. However, I cannot verify or understand this answer. This domain contradicts the rules of nic.at (see nic.at Registration Guidelines) and the standard (RFC) and should never have existed.
The new version GPT-4 gives the correct answer that the longest domain is 63 characters long. However, it does not know any specific domain of this length. AI still can't solve all our problems 😜
Since 2016, the minimum length of a .at domain has been exactly one character (letter or number). Before 2016, domains had to be at least three characters long. Currently, all .at domains with one or two characters are taken. However, there are still thousands of such domains available for .co.at or .or.at.
Length | TLD | Available | Examples | |
---|---|---|---|---|
1 | .at | 0 | ||
1 | .co.at | 0 | ||
1 | .or.at | 6 | q.or.at, 4.or.at, 7.or.at | |
2 | .at | 0 | ||
2 | .co.at | 928 | 00.co.at, 11.co.at, gg.co.at | |
2 | .or.at | 1148 | zz.or.at, yy.or.at, kk.or.at |
"Drei hob i gsogt!": The shortest currently available .at domains are three characters long. There are still more than 30,000 available. So, about 40 % of all 3-character .at domains are registered. Some examples of available domains are: 003.at, 00a.at, k8n.at, zuc.at, 8-j.at.
3 Chars .at-Domains
A domain is available if nic.at says it is. However, multiple Whois queries are not possible at nic.at, as your own IP will be blocked relatively quickly.
To check a large number of domains, you could use resellers. However, I made DNS queries and found that all available domains returned the following answer ...
% dig j-8.at +noadditional +noquestion +nocomments +nocmd +nostats SOA
at. 10562 IN SOA dns.nic.at. domain-admin.univie.ac.at. 1680930002 10800 3600 604800 10800
* Domains with "pendingDelete" status also return this answer. However, there seem to be very few of these, and you can only find out the status by making a Whois query.
A registered domain has NS entries and a different SOA entry.
% dig nic.at +noadditional +noquestion +nocomments +nocmd +nostats SOA
nic.at. 905 IN SOA ns1.nic.at. domain-admin.univie.ac.at. 2023043280 3600 1800 1209600 900
nic.at. 900 IN NS ns1.nic.at.
The length distribution is a classic positively skewed bell curve. More than 50 % of the found domain names are between 7 and 14 characters long. Only just under 5,000 domains are longer than 30 characters."
.at-Domains - Length distribution
Of course, we all want to know what such long domains can look like. Here are some examples that are currently registered but do not have any content.
🤷
Internationalized Domain Names (IDNs) are domains that contain characters other than a-z, 0-9, or hyphen (or non-ASCII characters). For example, German umlauts (ä, ö, ü), which is why they are often referred to as umlaut domains. Such IDNs have only been available since 2004, and according to nic.at statistics as of April 2023, just under 36,000 have been registered.
IDNs are typically stored and processed as Punycode. Punycode is an encoding system in which special characters are encoded using ASCII characters and appended to the end of the domain. Users don't typically see Punycode in their daily use of the internet because the conversion happens automatically in the background by browsers and servers.
Punycode domains start with "xn--" and end with a hyphen and a character representing the encoded special character. For example, "österreich.at" becomes "xn--sterreich-z7a.at".
There are still some short umlaut domains available: ö.or.at, ü.or.at, ä.or.at, ü.co.at, ä.co.at. "ß" (sharp s) is not supported in .at domains.
The dataset used contains 2.1 % IDNs.
% IDN Domains
IDNs can be used for malicious purposes. By using certain special characters, users can be deceived, so I was curious which special characters other than German umlauts are used in .at domains.
I found 159 .at IDNs that contain other characters than umlauts, such as ç, ë, ó or é. At first glance, there is nothing exciting about them. Perhaps interesting are ímmowelt.at and ìmmowelt.at, which redirect to immowelt.at. Here, the company has obviously taken proactive measures to prevent others from registering such misleading domains.
There are several TOP lists from different providers. The most well-known is probably the Alexa Ranking list (which has nothing to do with Amazon's AI assistant). However, Amazon has bought this company and the Alexa website is no longer accessible. Currently, the domain list is still available. This will surely change soon.
The ranking methods differ somewhat. In general, some form of PageRank algorithm is applied. This algorithm evaluates the popularity of a website by counting referring pages (inbound links): the more other pages link to a page, the higher it is in the list.
Other TOP lists use traffic as a measure, which is measured by tracking pixels (ÖWA) or browser extensions (Netcraft).
The results are, let's say, "okay". As you can see in the table below, there are some incorrect and some questionable entries. For example, Majestic ranks the "actually" TLDs gv.at and or.at in the TOP 10. Then there are some pages on the list that I have never heard of. The ÖWA only includes sites that voluntarily want to be measured (and pay for it).
"Tranco" is a project that was launched by several scientists and aggregates multiple sources and mathematically weights them. They call it "A Research-Oriented Top Sites Ranking Hardened Against Manipulation"... Okay, you decide!
Attention: Table has a vertical scrollbar.
Place* | Majestic [1] | Alexa | Cisco [2] | Netcraft [3] | Tranco [4] | ÖWA** [5] | Similarweb [6] | Overall* |
---|---|---|---|---|---|---|---|---|
1 (10) | gv.at | orf.at | ad4m.at | orf.at | univie.ac.at | orf.at | orf.at | orf.at (59) |
2 (9) | google.at | derstandard.at | google.at | bergfex.at | orf.at | willhaben.at | google.at | google.at (41) |
3 (8) | kriesi.at | krone.at | optadata.at | willhaben.at | google.at | krone.at | krone.at | willhaben.at (36) |
4 (7) | univie.ac.at | willhaben.at | gmx.at | harryahamer.at | derstandard.at | heute.at | willhaben.at | derstandard.at (33) |
5 (6) | orf.at | google.at | waust.at | geizhals.at | kriesi.at | derstandard.at | heute.at | krone.at (28) |
6 (5) | shorturl.at | hurawatch.at | willhaben.at | bawag.at | shorturl.at | meinbezirk.at | derstandard.at | univie.ac.at (19) |
7 (4) | or.at | shorturl.at | orf.at | karriere.at | krone.at | kurier.at | oe24.at | heute.at (16) |
8 (3) | derstandard.at | heute.at | derstandard.at | raiffeisen.at | world4you.at | gmx.at | gmx.at | kriesi.at, shorturl.at (14) |
9 (2) | tuwien.ac.at | toon.at | interspar.at | univie.ac.at | tuwien.ac.at | oe24.at | kleinezeitung.at | |
10 (1) | wko.at | oe24.at | post.at | c3w.at | wien.gv.at | kleinezeitung.at | bergfex.at | gmx.at (13) |
* In the last column, I have counted the results of the table (unscientifically) and the more often and higher up the pages appear in the table, the higher the score (in brackets) and placement in the column.
** The ÖWA data is from February 2023 and sorted by unique users per individual offering.
Sources:
Domains may contain hyphens, but not at the beginning or end. This makes domains longer but more readable. When domains are divided by hyphens, it provides a good overview of the individual words or parts of speech used in the domain.
There are about 460,000 domains that contain hyphens, and dividing them creates a list of 157,000 different words. I used the top 10,000 keywords with more than 3 characters to search for them in other domains. This helps to find keywords that appear in domains without a dividing hyphen.
Keywords in domains (>3 chars)
Other parts of domains: Depending on what you're looking for, you can already deduce a lot from domain names. Here are some examples of evaluations from the domain data.
Domains with years
Top locations in domains
Health sector domains
Domains with prefix
Domains with first names
Domains form hospitality-sector
The .priv.at domains are a special domain for individuals that are awarded by the VIBE!AT association (Association for Internet Users Austria).
To get a .priv.at domain, one must be a private individual and be residing in Austria. The domain name can apparently be chosen quite freely, as long as it does not infringe on registered rights. Registration of a .priv.at domain is free.
The dataset contains 294 .priv.at domains. Most of them are first names, last names, or nicknames. The nic.priv.at whois form still shows the full personal data of many domain owners 🤨.
An IP address is a series of numbers that uniquely identifies a device on a network. This allows devices to communicate with each other. IPv4 addresses are still the most common, and anyone who has ever set up an internet router at home, probably knows what IP addresses look like. Something like: 192.168.0.1
Websites also have an IP address, or rather, the servers on which these websites are hosted. If you trace the .at domains back to the IP addresses, you will see that 205,572 domains (15.6 percent) are currently not assigned to any IP address.
% Domains with/without IP
Not every website needs exactly one unique IP address. It is possible for many domains to point to one IP address, and the server to deliver all these websites. The address could also be a load balancer, behind which a whole network is hidden. This network can in turn be responsible for many domains.
Find A records with dig
% dig +noall +answer orf.at A
orf.at. 21481 IN A 194.232.104.140
orf.at. 21481 IN A 194.232.104.139
...
orf.at. 21481 IN A 194.232.104.149
Find A records with nslookup
% nslookup orf.at
Server: 192.168.50.1
Address: 192.168.50.1#53
Non-authoritative answer:
Name: orf.at
Address: 194.232.104.3
...
Name: orf.at
Address: 194.232.104.141
Find IP address with ping
% ping orf.at
PING orf.at (194.232.104.3): 56 data bytes
64 bytes from 194.232.104.3: icmp_seq=0 ttl=54 time=25.222 ms
The 1.3 million .at domains lead to one of 112,162 IP addresses, so it would then be one IP per 10 websites on average. This average is due to a few IPs that are responsible for many thousands of domains. These IPs belong to large hosting companies (e.g., World4You, Host Europe, ...) or domain parking services (e.g., Sedo).
Domain count per IP address
The leader here is the IP address 81.19.154.98 from World4You, which is responsible for 27,802 .at domain names. If you call up the IP or the respective websites, you can quickly see why this is the case. Either you will be redirected to another domain immediately (Domain Redirect Service) or you will end up on a domain parking page.
How do you actually know who is responsible for an IP? To find this out, there are several ways: Some information can be seen in the DNS PTR record. This is essentially the counterpart to the A-record and establishes the reverse link from IP to domain name. The request is made with an IP and you get a hostname in return.
% dig +noadditional +noquestion +nocomments +nocmd +nostats -x 142.251.37.3
3.37.251.142.in-addr.arpa. 26701 IN PTR muc11s23-in-f3.1e100.net.
The easy and fast way is to use an IP/Geo-Location database, like the Maxmind GeoLite DB. Such databases contain many IP addresses and the associated organizations that can be identified via ASN (Autonomous System Number).
# composer require geoip2/geoip2
require 'vendor/autoload.php';
use GeoIp2\Database\Reader;
// Confused? The "City"-DB contains also countries
$reader = new Reader('/usr/local/share/GeoIP/GeoIP2-City.mmdb');
$record = $reader->city('128.101.101.101');
var_dump($record->country->name);
The top 10 responsible organizations by ASN (Autonomous System Number) are mostly located in Germany and are large hosting companies like Hetzner, Host Europe, IONOS, or even Amazon. A special case is Cloudflare, which operates a large international network in the field of security and performance and already includes 6,832 IP addresses (related to .at domains).
IPs by ASN (Organisation)
When looking at the number of IPs and domains on /24 IP blocks, you can see that some hosters distribute their domains across multiple blocks and IPs, while others host a large number of domains on a single block with very few IPs.
An IPv4 /24 IP block is the address range from 1 to 254 at the last position of the IP address. So, for example, from 81.19.159.1 to 81.19.159.254. In the next two charts, you can see the top 10 blocks with the most domains and how many IP addresses were used within the block.
IP count with domains in /24 IP blocks
The top 3 blocks host a similar number of domains (41 to 46 thousand). IONOS distributes these domains across the entire block (254 IP addresses) and World4You uses only just over half of the IP addresses in one block but uses multiple blocks.
Domain count in /24 IP blocks
What else could be done with IP blocks? In IP blocks where only some addresses are used, the question is, of course, what can be found on the other addresses? Either they are not used, or they contain systems and domains that have not yet been discovered.
Below is an anonymized example of an IP block scan for IP addresses not yet present in the dataset. A network ping is sent, and ports 80 and 443 are scanned. There would still be quite a bit to find, but I've left it at that for now.
The evaluation of the 112k IP addresses shows that the IP addresses are located in 95 different countries. The majority, namely 52,931 IPs, are in Germany. If you add up Austria, the USA, France, and the Netherlands, there are still more IP addresses with .at domains hosted in Germany.
21 countries are each responsible for only one or two IPs (e.g., Iraq, Georgia, Colombia, ...). In Russia, there are 191 IPs of .at domains and 47 are in China.
Top 10 IP count per country
There are some free providers of IP/GeoLocation databases or APIs that can help you find the location, country, and responsible organization of an IP address.
The location of an IP address can change from time to time. For example, when IP blocks are resold to other service providers or when a service provider uses a block in a different data center.
Ports are used to separate the communication of applications and services on a device from each other and thus ensure orderly and efficient communication over the network. Each port is identified by a unique number called a port number, which ranges from 0 to 65535.
Typically, specific port numbers are reserved for specific applications and services. From this, conclusions can be drawn about the software used or at least about the available services of the server.
To scan or have it scanned? Port scanning is not complicated and can be done with tools like Nmap or a few lines of source code. However, port scanning is time-consuming, if you want to scan many ports per domain. Fortunately, someone has already done this, and the results are available via a free API.
For simple port scanning, you don't necessarily have to work with Nmap. Whether a port is open or closed can be quickly and easily checked with PHP.
$ipOrHost = 'www.parlament.gv.at';
$port = 25565;
$timeout = 0.2;
$connection = @fsockopen($ipOrHost, $port, $errorCode, $errorMessage, $timeout);
if (is_resource($connection)) {
echo 'OPEN';
}
The Shodan InternetDB is designed to quickly find security vulnerabilities and, for example, monitor your own IP. If, like me, you already have an extensive list of domains and their associated IP addresses, Shodan can be used to find out which ports are open on the respective IPs.
Almost all evaluations in this section are based on Shodan data, only the ports for HTTP vs. HTTPs have been scanned by me. Lists of default ports for different services and software can be found all over the internet. See, for example, Wikipedia TCP/UDP Port Numbers or Secbot (Common Ports)
Using the API, 112,162 IP addresses were checked. Of the checked addresses, 95,717 IPs (85%) have an entry in the Shodan database.
Percentage of IPs with/without Shodan entry
In general, one should make sure that not too many ports are open on a server, as each open port is a potential attack surface from the outside. However, if servers offer many services or act as a gateway for other servers, it can happen that 20 or more ports are open.
42 % of the IP addresses have fewer than 5 open ports, and it is usually a typical combination for web servers consisting of 80, 443, and either SSH (22), FTP (21), SMTP (53, 587), or MySQL (3306).
Count of open ports per IP
"What you see is what you get." - The evaluation resulted in a list of 1,298 different port numbers. Many of these ports are used only occasionally. The most commonly used ports are from the areas of email (IMAP, PHP, SMTP), file transfer (FTP), shell access (SSH), and logically the standard ports intended for web content (80, 443).
Number of IPs with typical ports
Web content is usually delivered on ports 80 (HTTP) or 443 (HTTPS).
A few years ago, Google pushed website operators to switch to the more secure HTTPS. Many did so at the time, fearing lower rankings, and switched to HTTPS. Major browser manufacturers also introduced measures to encourage website operators to make the switch. Connections were marked as "insecure" or some browser features were only available for HTTPS pages.
Therefore, it is not surprising that 88 % of .at websites can be reached via HTTPS and HTTP. 7 % of the checked pages are only accessible on HTTP (Port 80), and only 3 % of the pages are configured for HTTPS-only.
Percentage of open HTTP(s) ports
It is possible for web content to be delivered on ports other than 80 and 443. The most common ports are variants of 80 and 443, such as 8080, 8000, 8081, or 9443. Port 3000 is also often used in node.js environments.
Such ports are sometimes used to operate a development or test environment alongside the actual site. The integration of 3rd-party systems into the network landscape is also achieved via alternative ports.
12 % of all checked IPs deliver web content on port 8880, 9 % on 8080, and 2 % on 8081. Upon further inspection, it becomes apparent that 8880 is almost always a Plesk login. On port 3000, you often encounter the Grafana statistics software. Other ports deliver error messages or blank pages or are subsequently blocked by a firewall (e.g., FortiGuard).
Occasionally, you'll come across login pages for other web-based applications: i-MSCP, ISPConfig, Kibana, Roundcube, ... Is this a problem? Yes, no, maybe! It depends on how secure the passwords used are and whether the respective software is regularly updated.
Note: Now it makes sense to take a short detour to the National Vulnerability Database. There, you can find current and past security vulnerabilities, searchable by software and exact version. Here's an example for "Roundcube." If you find vulnerabilities for the software and version there, then it's a problem.
In most cases, databases should not be directly connected to the internet because it provides an attack surface that can be easily avoided. In some cases, however, it is unavoidable, and then the database in use can be determined via port scanning. In the checked IPs, there are about 22,000 open database ports, and the vast majority of them, 18,091 to be exact, are MySQL databases.
All popular databases (data-stores) have defined default ports that are usually not changed. However, some databases use very generic ports (8443, 9443, or 8080) that are not considered in this evaluation because these ports would not provide meaningful clues.
Number of IPs with open DB ports
A web application server provides an environment in which web applications can be executed. This includes the provision of runtime environments, frameworks, database connections, security features, and other services required for the development and operation of web applications. The used web app servers are often recognizable from the outside by their default ports.
Number of IPs with evidence of app servers
We have to answer the most important question at the end of the chapter 😜. Are there IPs with open Minecraft ports in the dataset? Yes. 191 IP addresses have an open port 25565, indicating an active Minecraft server.
And can you connect? The first two servers had a different version, and for the next two, I wasn't on the player whitelist, but on the 5th server on the list, I was able to log in, even though the server and the IP address were not known as a Minecraft server anywhere on the Internet.
DNS records connect IP addresses with domain names. In simplified terms, a DNS server is a large database that contains different types of entries and is mirrored on many other servers on the Internet.
In the IP addresses section, this database was already used to determine IP addresses. However, DNS records can reveal even more interesting things. DNS entries can be read with various tools. The analyses in this section were all done with the command-line tool dig, saved, and then parsed with PHP. The dataset comprises 6,288,955 of 888,450 domains and is 347 MB in size.
For security reasons, some name servers do not respond to so-called ANY queries, which return all records of every type. Therefore, in the first step, the 7 most important entry types (NS, TXT, A, AAAA, CNAME, MX, SOA) were individually queried.
To obtain the most comprehensive picture possible, in the second step, ANY queries were sent to all domains, and the analyses related to this are presented in the "DNS ANY queries" section.
The command-line program dig can be used to make queries to nameservers. Here, all records (ANY) for orf.at are requested from the nameserver with the IP address 8.8.8.8 (Google).
dig +noadditional +noquestion +nocomments +nocmd +nostats orf.at ANY @8.8.8.8
First of all, an overview of the types and frequency of DNS records. Common types include NS, MX, A, AAAA, TXT, CNAME, and SOA records.
DNS records by type
DNS records by host/type
Each website needs an NS, A, and SOA record, otherwise, it would not work. 853,055 domains have entries of these three types. As can be seen in the chart on the left, some types of entries (e.g., NS) are assigned multiple times to increase fault tolerance.
AAAA records are IPv6 addresses. For many years, it has been warned that all IPv4 addresses have already been assigned and that everyone should switch to IPv6. However, apparently, that has not happened yet.
The Mail Exchange (MX) DNS records indicate which mail server is responsible for the domain. This can provide insights into the software and cloud providers used.
I would have expected most domains to have at least two or more MX records. This would increase fault tolerance and would hardly require any extra effort. As the chart on the left shows, this is not the case, and there is a reason for it.
The individual MX records point to subdomains (e.g., xyz.mail.protection.outlook.com) that have multiple IP addresses (A records) assigned to them. If we include this in the calculation, the chart (on the right) looks somewhat different.
Count MX records
by domain
Count MX records by domain
(Multi IPs included)
If you are building airplanes and want to make them really safe, you can simply create 12 MX records.
741,491 domains have at least one MX record and can theoretically receive emails. The top 10 mail servers include major hosting companies and cloud providers.
Domains that point to outlook.com use a Microsoft 365 (formerly Office 365) product. MX records that contain "aspmx.l.google.com" indicate a Google Workspace product.
Mail server hosts
When looking at the distribution of domains among mail servers, it becomes apparent that the top 15 are responsible for more than 5,000 domains each. After that, there are around 60 mail providers that manage emails for 500 to 5,000 domains. 7,814 smaller hosts or larger companies are responsible for 2 to 500 domains. The majority of mail servers, namely 282,377, are responsible for exactly one domain.
Domains by mail server
External vs. Internal MX domains: If the domain matches the mail server domain, it is likely that a local mail server is responsible for receiving emails. Different domains can indicate the use of a cloud provider or the hoster operating a centralized mail service.
Mail servers external/internal
Note: When looking through the MX list, it can be seen that around 200 domains have an MX record from the Russian cloud providers Yandex or mail.ru. The Chinese mail providers qq.com and 163.com are only used by 6 .at domains.
In TXT records, you can write anything you want, as long as it doesn't exceed 255 characters. If you want to pack more text into DNS, just create multiple TXT records. In my scans, I found 560,400 TXT records for 417,348 different domains.
Domain count with n TXT records
Most domains have exactly one TXT record, while some have as many as 50 records. I mean, "Sparkasse, quo vadis?" (where are you going?) I hope that's intentional.
In recent years, TXT records have been increasingly used to verify ownership of a domain. When a domain is to be used with a cloud provider, the owner is asked to create a special TXT record so that the provider knows the domain really belongs to the customer.
By the type of record, it is often possible to tell which domain is working with which SaaS provider. Providers such as Google, Facebook, Apple, and Zoho use this type of domain verification.
Verify TXT records
Based on the Verify TXT record alone, one cannot tell which specific product is being used by each provider. One can only say in general that 53k domains are using something from Microsoft and 44k are using some product from Google.
Many other providers use TXT records to store configurations or verifications for specific products. During a brief review, I found more than 70 providers:
Amazon SES, BMD, Barracuda, Brave, Cisco, Citrix, Cloudflare, ClubDesk, DigiCert, Docker, Drift, DUO, Dynatrace, Elastic Email, Firebase, Fortinet, Freshdesk, GitHub, GitLab, GlobalSign, HIBP, Hornetsecurity, HubSpot, IBM, Indeed, Infoniqa, Jimdo, KnowBe4, MIDOCO, MS Dynamics, MS Office365, Mailjet, Mailru, Mandrill, Microsec, Mimecast, Miro, MongoDB, nameshield, Offensity, OneTrust, Oracle, Pardot, Plesk, Postman, Protonmail, Rexx, SAP, Salesforce, SendGrid, Sendinblue, Seobility, Shopify, Sipgate, Smartsheet, Sophos, Spycloud, Squarespace, Stripe, TITAN, TOPDesk, Trend Micro, Trustpilot, Webex, Webflow, Wix, Wordpress, Workplace, Wrike, Yandex, Zendesk, Zoom, blackscreen, dan.com, eRecruiter, flexera, iCIMS, mailEnable, proofpoint, sevDesk, site24x7, successfactors, x-mailer, Yandex
Many of the providers and products can be found in the SPF information. SPF stands for "Sender Policy Framework" and is designed to make email communication more secure. An SPF record lists all the mail servers that are allowed to send emails on behalf of the domain. Therefore, many marketing, support, and sales systems can be found in the values.
380,332 domains are secured by the Sender Policy Framework. Another protocol that is designed to improve email security is DMARC, but it is only used by 752 domains.
TXT records for SPF/DMARC
TXT - "domain blocked": I found the TXT entry "domain blocked" in 10,755 domains. Checking some samples showed that all these domains appear in the nic.at Whois with the status "pendingDelete". Therefore, these could be TXT entries from nic.at for domains that do not comply with the rules.
TXT with advertising: You can always be pretty sure that wherever there is a text field, someone will come up with the idea of filling it with advertising. Creativity or desperation?
The previous analyses have focused on specific DNS types. ANY queries are not supported by all servers because they are complex and slow (see Cloudflare Blog). However, what do you get when you send ANY queries to the name servers? Surprisingly, many servers respond, and types such as CCA, HINFO, RRSIG, DNSKEY, and DS are widespread among .at domains.
SPF - "You're Doing it Wrong!": 0.5 % of the domains have entries of type "SPF", which are actually wrong because SPF entries must have the type "TXT" and should be changed accordingly. Although the type existed at some point, it was revoked in 2014 and is no longer supported by many servers (see Wikipedia).
HINFO: By indicating RFC8482 or "ANY not supported" in the HINFO entry, it is shown that no ANY queries are answered by the server. Currently, this applies to 6 % of the queried domains.
DNSSEC entries: DNSSEC stands for Domain Name System Security Extensions and is intended to ensure the authenticity and integrity of DNS entries through digital certificates. There are different types of entries that all have a specific role in DNSSEC: DNSKEY, RRSIG, NSEC, NSEC3PARAM, DS. About 3 % of .at domains use DNSSEC.
CAA entries: CAA entries were specified for 1 % of domains, indicating which Certificate Authorities are authorized to issue certificates for this domain. Typically, providers such as geotrust.com, letsencrypt.org, or digicert.com are listed in this record.
HTTP headers are pieces of information sent from a web server to a client (such as a web browser) to provide additional details about the transferred data. This information can include various things such as the type of content, response size, cache instructions, and more.
Out of the 1.3 million domains initially accessed, 888,450 provided a response and a total of 6,288,955 HTTP response headers were stored. An overview of HTTP headers can be found on MDN, Wikipedia, or at OWASP.
The server header is a response header that is optionally sent by web servers to identify the name and version number of the web server or server software being used. The server header can potentially reveal sensitive information that attackers can exploit to find vulnerabilities.
Apache is still twice as large as Nginx, responsible for delivering only 160,585 websites while Apache handles 305,249 websites. 555,428 websites returned one of 610 different server headers.
HTTP server header
The HTTP header "X-Powered-By" is an optional HTTP response header field sent by web servers to identify technologies being used. The X-Powered-By header may contain information such as programming language, web server software, database software, and other technologies being used.
22.8 % of homepages send an X-Powered-By header. 475 different headers were stored, with the TOP 10 led by PHP, Plesk, and ASP.
Pages with X-Powered-By
HTTP X-Powered-By Header
HTTP CSP (Content Security Policy) headers are a mechanism for protecting web applications from cross-site scripting (XSS) and other attacks. By defining trusted sources, the injection of malicious data into the web page is intended to be prevented.
Only 0.7 % of websites send a CSP header. This seemed low to me, until I remembered, that CSP can also be defined in HTML using a tag. Therefore, the numbers from the CSP HTML evaluation were integrated here. After that, it was a "whopping" 0.9 % 🙄.
Domains with/without CSP
138,221 websites provided both a Date and a Last-Modified header. The calculated age of the content shows that there are many websites that update daily, but also many whose content has not changed for over two years.
Domains with/without Age
Age in days
Cookies are data stored by a website on a user's computer or mobile device when they visit the site. Cookies contain information about activities on the page or specific settings.
Cookies have had a "small" image problem for a few years because they have been used for things that were not intended when they were invented. That's why website operators must ask users beforehand if they can set cookies, at least for all cookies that are not technically necessary. Of the 578,385 websites crawled, 29.4% set cookies on page load (without asking). Hopefully, only technically necessary ones?!
Pages with cookies
There are pages that use cookies as a kind of "database" and store all kinds of stuff in them. That's why the top performers in the statistics set a double-digit number of cookies as soon as you open the page. 😳
TOP 10 - Cookie count on page
The most commonly used cookies are typical session cookies (e.g., PHPSESSID, beng_proxy_session, ...), cookies for security-related features (e.g., XSRF-TOKEN), and cookies that store settings (e.g., localization, pll_language).
TOP 10 - Cookie names
Unfortunately, there are also cookies that probably do not fall under the "technically necessary" category. Cookie names such as facebookPixel, remarketing_cid, SC_ANALYTICS_GLOBAL_COOKIE, ad_storage, gtm, trackings, ... already sound suspiciously like advertising, analytics, or tracking? Fortunately, they are only found occasionally.
When downloading the homepages, the time required to download them was measured. Only 2% could be downloaded within 100 milliseconds. 22.8% required between 100 and 249 ms. For the largest proportion of pages (39.2%), downloading took between 250 ms and 500 ms. For 16.3%, downloading took longer than one second to complete.
Page speed
More than 500 ms is already relatively slow for a website. One must keep in mind that only the download of the HTML is considered here and the time is measured until the HTML is completely received by the browser. Stylesheets, images, and scripts still need to be loaded and executed there.
HTML stands for "Hypertext Markup Language" and is the markup language used to create web pages. It is used to define the content and structure of a web page by including various elements like headings, text, images, and links, which are interpreted and displayed by a web browser.
HTML consists of a set of tags (markup elements) that tell browsers how to display the content. For example, the <h1> tag can be used to create a first-level heading, while the <p> tag is used to define a paragraph.
In April 2023, I attempted to download 1.3 million homepages. Before downloading, a check was made on port 80 or 443, and only responses without errors were considered (HTTP status code: 200). As a result, I was able to save the HTML of 578,385 homepages and analyze them afterward.
With the GitHub package crwlrsoft/crawler from crwl.io, you can easily and quickly develop web crawlers that download entire web pages or only parts of them.
use Crwlr\Crawler\Steps\Loading\Http;
$crawler->input('https://www.orf.at/')
->addStep(Http::crawl()->depth(1));
The proud surprising winner of the size comparison, is a website with just over 30 MB. Please note, this is only HTML and no images, JS, CSS, or anything else. One might think that there is a lot of meaningful content on it, but unfortunately, that's not the case. The page was obviously created with Microsoft Word and exported to HTML, containing a lot of invisible code.
Fortunately, such page monsters are rather rare (only 3,406 or 0.58 % have more than one megabyte). Just over 50 % of the 577,552 HTML responses are between 1 and 50 kilobytes, and another 34 % are between 50 and 256 kilobytes.
Distribution of HTML sizes
How big is the proportion of content to markup, actually? If you remove all HTML tags from the markup, you get the pure content of the page. More than 85 % of websites have a content proportion of up to 60 %. Only just over 15 % of pages have a content proportion of more than 60 %.
What is the content proportion?
HTML-Tags per page
The maximum number of HTML tags on a page is 271,726. This page is 21 megabytes in size and is managed with Wordpress. The commenting function is set up so that anyone can comment, resulting in 19,537 spam comments on the page. 🤕
Top-10 HTML-Tags
DIV does not stand for "diverse", but it is used for that purpose. Not surprisingly, there are so many <div> tags in HTMLs. But links (<a> tags) in second place and <script> and <link> tags in the TOP 10 are quite interesting.
SEO stands for "Search Engine Optimization" and refers to the practice of designing and optimizing websites to appear as high as possible in search results.
There are specific recommendations for how a well-optimized SEO page should be designed. First, the HTML should be "well-formed", which means there should be no errors in the markup. You can easily check this with HTML Tidy. Only 6.9 % of the pages are error-free, 84.1 % have warnings (small errors), and 9 % of the pages have major HTML errors.
HTML errors and warnings
HTML with SEO tags
A few basic things should be found on an SEO-optimized page. 21.7 % of the tested pages contain a DOCTYPE, <title>, <meta>-Description, exactly on <h1> and min. one <a>-Tag.
HTML Tidy can be used directly in PHP, as there is a native PHP extension for it.
# sudo apt-get install php8.2-tidy
$tidy = tidy_parse_string($htmlContent);
$tidy->cleanRepair();
$tidy->diagnose();
var_dump($tidy->errorBuffer);
I prefer to work with HTML using the Symfony DomCrawler extension.
# composer require symfony/dom-crawler
# composer require symfony/css-selector
use Symfony\Component\DomCrawler\Crawler;
use Symfony\Component\CssSelector\CssSelectorConverter;
$crawler = new Crawler($html);
$titles = $crawler->filterXPath($converter->toXPath('title'));
var_dump(count($titles));
Unlike non-semantic tags like <div> or <span>, which are only used to structure sections of the webpage or group elements, semantic tags provide specific information about what type of content is in the tag.
Commonly used semantic tags include: <header>, <footer>, <main>, <aside>, <nav>, and <section>. 71% of the tested web pages use semantic block-level HTML tags.
HTML with semantic tags
Structured data in HTML are a standardized format for displaying information about the content of a web page. These metadata are used by search engines, social media, and other web services to better understand the content and context of a web page.
On schema.org, a collection of metadata specifications (called "schemas") is published that developers can use to make web pages more machine-readable. The most commonly used elements relate to technical details such as EntryPoint, SearchAction, BreadcrumbList, ... or the general indication that it is a WebSite or WebPage.
TOP 10 schema properties
Links are still an important ranking factor for all major search engines. Who is linking to whom and how often? Within the .at domains, the most linked pages are herold.at, google.at, and wko.at.
TOP 10 linked pages
Most of the list is not surprising, but why is the Federal Chancellery (bka.gv.at) so well-linked? The Legal Information System can be reached under the subdomain ris.bka.gv.at, and many companies link in the footer to the current Trade Act, which can be found in the RIS. Mystery solved!
The highest number of outgoing links to different .at domains is found on the page: museen-in-oesterreich.at. There are 561 .at domains linked there and a total of 1,551 external links.
Outbound links histogramm
The <img>> tag is used to insert an image into a web page. The tag has a required attribute "src" which specifies the URL of the image to be inserted. I found 8.7 million images on 489k pages, of which 2 million had an ALT attribute.
<img> Total
8.745.391
Seiten with <img>
489.633
<img> with ALT
2.090.312
Max. on one page
27.377
One page even had 27,377 images embedded on the page, and because no one would believe it, I have a proof screenshot here. It's not as bad as it seems at first, the images are lazy-loading and the page is "okay" fast. Still not optimal.
The "alt" attribute is an important attribute for images. It indicates what is shown in an image if the image cannot be displayed for any reason and also serves to improve the website's accessibility for people with visual impairments.
Do important things appear in "alt" attributes? "Logo" is the word that appears by far the most in "alt" texts. You can find an image with "Logo" alt text on 128,494 pages. I particularly like "alt" texts that contain "Image" (10,685 pages), "Bild" (5,496 pages), "Foto" (4,128 pages), or "Icon" (9,370 pages). Not.
Often used words describe either menu elements or social media links or sharing buttons. Below is a small overview of these two categories.
menu, menü, menue, menu-icon, mobile-menu, submenu, ... | 62.508 |
---|---|
home, haus, homepage, ... | 16.864 |
arrow, pfeil, arrow-right, arrow-left, pfeilchen, abwärtspfeil, previous, next, richtungspfeil, pfeil-icon, navigationspfeil, pfeillinks, pfeilrechts, ... | 13.001 |
icon, icons, ... | 9.662 |
burger, burger-menu, bento ... | 866 |
facebook, facebookicon, facebook_pixel, facebook-logo, ... | 18.602 |
---|---|
instagram, instagramicon, instagramm, ... | 9.934 |
youtube, youtubeicon, social-logo-youtube, ... | 8.178 |
linkedin, linkedin-logo, ... | 3.603 |
tiktok, tiktok-logo, #tiktok, ... | 498 |
If you are a data protection warning lawyer, and you are looking for a source of income, a disciplinary proceeding at the bar association does not deter you and you want to stand trial for serious fraud, then you can "surf" the 95,102 websites with Google Fonts yourself and then issue warnings. 🤡 ... and please don't forget the 130 .gv.at pages.
Startpages with Google tags
By now, everyone knows that Google Fonts are "warnable". However, the problem with external links that send private IP addresses to the USA or other countries is bigger and affects virtually all external resources that are included using <script> or <link> without user consent.
If you can find 95k pages with Google Fonts, how many pages actually include external resources from other domains? I was able to find a total of 379k domains (65 %) that use external scripts or stylesheets and thus potentially transfer personal data to third parties.
TOP 20 external resources domains
Resources (scripts or stylesheets) from all Google domains together are embedded on over 221k websites. These include Google Fonts, Google Analytics, and Google Tag Manager, for example. The content management systems Jimdo, WIX, and WordPress are in the "Top Ten". World4You (place 3) appears again here because so many .at domains are parked there.
In this section, all analyses that were conducted based on website content but are not specifically related to HTML or other technologies are included.
The word count, after removing HTML, scripts, and styles, ranges from 0 to 600,000 words (... again, the Wordpress page with open comments 😳). On 29% of websites, less than 100 words are written. Some of these probably load content using JavaScript, and the others are placeholder pages or pages of people who like to keep it brief.
Word count on websites
In Austria, the disclosure requirement for electronic media under the Media Act (commonly referred to as an imprint obligation) has existed for several years. Since both commercial and private websites require an imprint, one could assume that the word "imprint" (or an English form) appears somewhere on every page?
Websites with/without legal notice
I was able to find the word "imprint," "Imprint," "Legal Notice," or "Legal Disclosure" on 64.67% of websites. Now we still need to account for pages that load all content through JavaScript and websites in languages other than German or English. Speaking of languages, what languages are found on Austrian websites?
With the PHP package patrickschur/language-detection, languages can be detected. I shortened the contents of the websites to 50 words and then performed language recognition. Not surprisingly, 70% are in German, for 20%, recognition was not possible, and 8.5% are in English. Among the other 1.5% of languages, mainly languages from neighboring countries can be found.
Website language
TOP 10 - Other: Czech (680), Vietnamese* (279), Polish (219), Hungarian (184), Slovak (176), Russian (133), Swedish (90), Slovenian (84), Turkish (80), Serbian (72)
279 pages in Vietnamese? That seemed fishy to me. And it is. All of these pages deliver the same content (see screenshot), which comes from a landing page platform in Vietnam. The domains belong to a domain reseller in Germany. Many of the pages are marked as phishing sites by Google Chrome.
Something is rotten in the state of Denmark. Currently, the pages seem to be harmless, but that can change at any time if new content is played out.
Gender-sensitive language is still not really common on Austrian websites. If you search the 345,105 pages with more than 200 words for typical gendering forms, you will only find them on 14 % of the homepages. The search was conducted for the recommended/common forms with "binnen-I" (In, Innen), asterisk (*in, *innen), colon (:in, :innen), slash (/in, /innen), and underscore (_in, _innen).
Gendering on websites
Regarding gendered forms, "binnen I" is the most popular at 8 %, followed by colon (4 %), asterisk (4 %), and trailing behind are slash (2 %) and underscore (0.3 %). The most commonly gendered words are: Mitarbeiter (employee), Kunde (customer/client), Schüler (student), Teilnehmer (participant), and Patient (patient).
Gender-Forms
Gendered words
* For the singular "binnen I" form (In), I had to do some manual post-editing and remove all obviously incorrect words, such as LinkedIn, LogIn, CheckIn, etc.
The list of gendered words contains 11,433 different words. In addition to many common terms, you can also find words that are not used so often: Trickdogtrainer/in, Corona-Verharmloser*innen, Woidarbeiter*innen, Qualitätsröster/innen, DownhillerInnen, Clown*innen, Hackbrett-Künstler/in, Wildtierschmuggler:innen, ViewerInnen, ...
I hope I was able to show through these examples that a structured analysis of websites can reveal a lot about your competitors or potential customers. The focus of the evaluations here was more on technical indicators and superficial analyses. However, the possibilities are much more extensive and can provide valuable insights for your business.
Michael Feichtinger. Developer and consultant. After developing 10 years in various web agencies and almost 15 years at a large Austrian job portal, now self-employed and available for hire. I deal with all topics related to web development, technologies, and processes in development departments.
You can contact me on Twitter, LinkedIn or per Mail.
Privacy Policy: No personal data is collected, processed, stored, or shared on this site. No cookies are set, and no tracking is integrated or other external dependencies.
Legal: Klosterstraße 3, 4020 Linz, Webdevelopment and Consulting, Member of WKÖ, Authority: Bezirkshauptmannschaft Linz, GISA: 35488286