Tu Felix 🇦🇹

Internet in Austria

Date: 2023-05-10, Version: 1.1

What's actually inside the Austrian internet?
How much? How big? What exactly? Where exactly? By whom?

In honor of the Austrian Staatsfeiertag, I collected 1.3 million .at domains, searched for their corresponding IP addresses, scanned important network ports, made DNS queries, retrieved home pages, and saved HTML, HTTP headers, and cookies. In total, 56 gigabytes of data were gathered, and analyzed extensively afterward. Here's what I found.

This site contains the results and an overview of the Austrian internet (only .at domains), as well as the sources and methods that can be used to obtain information from public data. If you want to learn more about future customers, potential partners, and persistent competitors, you often just need to know where and how to search for them on the internet.

Wikipedia: The phrase "Tu felix Austria" is said to describe Austrians as having a particularly happy disposition or way of life. It was first used, probably by Duke Rudolf IV, on his seals in 1364...

Which came first, the chicken or the egg? Where is the best place to start? There are several possible starting points. The .at domains seemed the most promising to me. It would also have been possible to use Austrian IP addresses, but this would have led to much poorer results, as will be seen in the IP address section.

What about Austrian websites that use .com, .net, .org, and other generic top-level domains? It's true that only a fraction was examined here. Nevertheless, .at domains are very popular and used by many companies and individuals. Finding out which other domains belong to Austria is a project for another day.

Overview and TL;DR

Too long; didn't read! For those who don't have time to read everything, here's a brief overview of the sections and key findings from each area.

Domains

Content: What domains exist in Austria? How to find domains? What domains are important and what can be learned from domains?

  • Currently, no .at domains with 1 or 2 characters are available (only or.at or co.at)
  • Only 5,000 domains are longer than 30 characters
  • Umlaut domains (IDNs) are hardly used (2.1% are IDNs)
  • According to TOP lists, orf.at, google.at, and willhaben.at are the most important
  • 460,000 domains contain hyphens
  • Many domains contain: locations, industries, and conjunctions
  • .priv.at domains are hardly used (294 .priv.at domains found)

IP Addresses

Content: What are IP addresses? How to find IP addresses? Where are the locations of the addresses and which IP is responsible for which domains?

  • 205,572 domains (15.6 %) are not assigned to any IP (no A-record)
  • The 1.3 million .at domains point to 112,162 different IP addresses
  • One IP address is responsible for 27,802 .at domains
  • IP addresses from 95 different countries. 52,931 IPs in Germany

Network Ports

Content: What are network ports? How to scan them? Which IP addresses have which ports open and what insights can be gained from this?

  • 95,717 of the IPs (85 %) have a Shodan-DB entry
  • 42 % of the IP addresses have less than 5 ports open
  • 1,298 different port numbers were found
  • 88 % of .at websites are accessible on HTTPs and HTTP
  • Of 22,000 open DB ports, 18,091 are MySQL

DNS Entries

Content: What DNS entries are there and what can be learned from them? Which domains can even work? Which technologies (mail providers, service desks, marketing tags, ...) and cloud providers are often used? Which mail providers are the largest?

  • 6,288,955 DNS entries in total
  • 853,055 domains have NS, A, and SOA entries
  • IPv6 AAAA entries were found for 131,592 domains
  • 741,491 domains have at least one MX entry and can receive emails
  • Over 450k domains use only one MX entry
  • Products from over 70 manufacturers can be easily identified in the TXT entry
  • SPF found at 380,332 domains. DMARC is only used by 752 domains

HTTP Headers & Cookies

Content: What are HTTP headers? What can be read from headers? What are commonly used headers? How often are web pages updated? Who else uses cookies?

  • Top 3 webservers according to headers are: Apache, Nginx, and OpenResty
  • Only 22.8 % of websites send an X-Powered-By header
  • 72,292 websites are developed with PHP according to the X-Powered-By header
  • Not even 1 % of websites use Content Security Policy (CSP)
  • 138,221 websites have Date & Last-Modified headers. 34 % of them have not been updated for 2 years
  • 29.4 % of websites set cookies when the homepage is called

HTML Content

Content: What is HTML? What can be found in HTML? How big is the HTML? Which tags are often used? Which content is included externally? How SEO-optimized are the pages? Who has included microformats? Evaluations of inbound/outbound links. Who uses images and which ones?

  • 578,385 homepages delivered an HTML content
  • Largest HTML found is just over 30 MB
  • 50% of HTMLs were between 1 and 50 KB
  • <div> is the most used tag with 78 million occurrences
  • Only 6.9% of web pages are error-free. 84.1% have warnings and 9% produce errors
  • Only 21.7% have Doctype, Title, Description, H1, and at least one link in the HTML
  • The TOP 3 domains with the most linking domains are herold.at, google.at, and wko.at
  • 489,633 pages have <img> tags. The maximum number of images on one page is 27,377

Website Content

Content: Who has an imprint? What languages are used on websites?

  • The content of 29% of web pages consists of less than 100 words
  • Legal statements could be found on 64.53% of web pages
  • 70% of web pages are in German and 8.5% are in English

Domains (.at)

It all starts with a domain. An address that can be entered into a browser and, in the best case, leads us to a fast, beautiful, well-made website with great content. A complete domain is called a Fully Qualified Domain Name (FQDN).

Web addresses, also called URLs (Uniform Resource Locator), start with the protocol (e.g. http://), followed by zero, one, or more subdomains (e.g. www), and then comes the domain name and the Top-Level Domain (TLD). Everything separated by dots. The domain name is usually a Second-Level Domain (SLD).

The country-specific Top-Level Domain (ccTLD) for Austria is "at". Not every website hosted in Austria or targeting people in Austria needs to have a .at domain. However, many do. 1.5 million, according to official nic.at statistics from April 2023.

Simple URL Examples
Protocol Subdomain Domain (Second-Level-Domain) Top-Level-Domain
https:// orf at
https:// tvthek orf at

nic.at is the company responsible for the allocation and management of .at domains. In addition to the .at TLD, .co.at and .or.at domains can be registered. The "co" stands for "Commercial" and "or" for "Organization". These SLDs are the equivalent of the international models .com and .org. However, the two SLDs are not nearly as popular as .at domains. According to nic.at statistics, 33,000 .co.at domains have been registered so far, and only 8,000 .or.at domains.

Second-Level-Domains and Public Suffixes

There are several other Second-Level Domains that are not directly managed by nic.at. The .gv.at domain is intended for government agencies and is managed by the Federal Ministry of Finance (see GV.AT domain management). For academic institutions, there is the .ac.at domain, which is assigned or managed by the University of Vienna (see ACOnet).

URL Examples with TLD and Public Suffix
Protocol Subdomain Domainname Public Suffix + TLD
https:// help gv.at
https:// augustin or.at
https:// www univie ac.at

In addition, there are SLDs operated by private companies or associations. These domains are often treated as TLDs in browsers (e.g. for cookies and the address bar), which is why the non-profit organization Mozilla maintains a list where operators can register.

Currently, there are 19 SLDs for .at registered in the Mozilla Public Suffixes List:

priv.at, sth.ac.at, wien.funkfeuer.at, *.futurecms.at, *.ex.futurecms.at, *.in.futurecms.at, futurehosting.at, futuremailing.at, blogspot.co.at, biz.at, info.at, 123webseite.at, myspreadshop.at, 12hp.at, 2ix.at, 4lima.at, lima-city.at, *.ex.ortsinfo.at, *.kunden.ortsinfo.at

The following analyses refer to the official five TLD/SLDs (at, or.at, co.at, gv.at, ac.at). Analyses for other SLDs can be found in specific sections (e.g. priv.at).

Details: Parse domains with crwlrsoft/url

With the Github package crwlrsoft/url from crwl.io, you can easily parse domains or URLs, taking into account the public suffixes and supporting IDNs.

use Crwlr\Url\Url;

$url = Url::parse('https://www.domain.gv.at');
var_dump($url->domainSuffix());

Sources and Research

There is no publicly available comprehensive list of all .at domains like there is for other TLDs (Sweden, Switzerland, Centralized Zone Data Service). Therefore, one must create a comprehensive list from various sources. There are both commercial providers and open-source projects for domain lists. Additionally, I used various methods to find additional domains in search results and public log files.

For .ac.at and .gv.at, I requested a comprehensive list from the responsible authorities. These requests were rejected on the grounds of data protection. Whose data is being protected there, anyway? Never mind.

A good source for new domains are the "Certificate Transparency Logs". CT Logs can be publicly viewed and are actually intended to bring more transparency to the issuance of security certificates. Since almost all websites are now delivered over HTTPS, each of these websites needs a certificate, which can then be found in these logs.

at co.at or.at gv.at ac.at
nic.at Statistics [1] 1.458.656 33.773 7.841 unknown unknown
domainsproject.org 954.141 13.761 2.928 1.239 1.261
ViewDNS.info 1.187.203 28.454 6.492 1.689 1.397
domains-monitor.com 462.553 8.094 1.717 1.023 1.054
staedtebund.gv.at [2] 2.106
bing Web Search API [3] 134 109
Short Domain Checker [4] 19.697 402 176
Cert Transparency (CT) Log Monitoring [5] 18.659 294 69 55 70
DMOZ Export (2016) [6] 17.884 426 141 228 174

Sources:

By combining all sources, one can obtain numbers that are quite close to the public statistics provided by nic.at. Achieving 100% accuracy is hardly possible, as new domains are constantly being registered, and many remain unknown if nothing is done with them.

Anyway, I work with what I have.
1.317.549 .at-Domains.

at 1.277.059
co.at 29.458
or.at 6.770
gv.at 2.794
ac.at 1.468
Total 1.317.549

Long Domains

The longest domains I have found are 63 characters long. This is most likely due to the label length restriction of 63 characters between two dots. Otherwise, people would certainly pack even more keywords into the domain or do something silly with it. I found seven .at domains with 63 characters.

The zzzz... domain also exists for ac.at and would therefore be the longest domain if you add the five characters of the TLD. This domain appears to be a test domain of the University of Vienna and currently does not serve any web content (HTTP/HTTPs).

Side note: I asked ChatGPT (3.5-turbo) about the longest known .at domain, and the AI was pretty sure that it had to be donau-dampfschiffahrtselektrizitätenhauptbetriebswerkbauunterbeamtengesellschaft.at. However, I cannot verify or understand this answer. This domain contradicts the rules of nic.at (see nic.at Registration Guidelines) and the standard (RFC) and should never have existed.

The new version GPT-4 gives the correct answer that the longest domain is 63 characters long. However, it does not know any specific domain of this length. AI still can't solve all our problems 😜

Short Domains. One, Two, ...

Since 2016, the minimum length of a .at domain has been exactly one character (letter or number). Before 2016, domains had to be at least three characters long. Currently, all .at domains with one or two characters are taken. However, there are still thousands of such domains available for .co.at or .or.at.

Length TLD Available Examples
1 .at 0
1 .co.at 0
1 .or.at 6 q.or.at, 4.or.at, 7.or.at
2 .at 0
2 .co.at 928 00.co.at, 11.co.at, gg.co.at
2 .or.at 1148 zz.or.at, yy.or.at, kk.or.at

"Drei hob i gsogt!": The shortest currently available .at domains are three characters long. There are still more than 30,000 available. So, about 40 % of all 3-character .at domains are registered. Some examples of available domains are: 003.at, 00a.at, k8n.at, zuc.at, 8-j.at.

3 Chars .at-Domains

Addition: Checking domain availability

A domain is available if nic.at says it is. However, multiple Whois queries are not possible at nic.at, as your own IP will be blocked relatively quickly.

To check a large number of domains, you could use resellers. However, I made DNS queries and found that all available domains returned the following answer ...

% dig j-8.at +noadditional +noquestion +nocomments +nocmd +nostats SOA
at.                     10562   IN      SOA     dns.nic.at. domain-admin.univie.ac.at. 1680930002 10800 3600 604800 10800

* Domains with "pendingDelete" status also return this answer. However, there seem to be very few of these, and you can only find out the status by making a Whois query.

A registered domain has NS entries and a different SOA entry.

% dig nic.at +noadditional +noquestion +nocomments +nocmd +nostats SOA
nic.at.                 905     IN      SOA     ns1.nic.at. domain-admin.univie.ac.at. 2023043280 3600 1800 1209600 900
nic.at.                 900     IN      NS      ns1.nic.at.

Length Distribution: Long Tail.

The length distribution is a classic positively skewed bell curve. More than 50 % of the found domain names are between 7 and 14 characters long. Only just under 5,000 domains are longer than 30 characters."

.at-Domains - Length distribution

Of course, we all want to know what such long domains can look like. Here are some examples that are currently registered but do not have any content.

  • zwinger-of-white-beautys-vom-klostertal.at
  • hier-könnte-ihre-werbung-stehen.at
  • therapiebegleit-besuchshunde-steiermark.at
  • deutsche-doggen-of-castle-jaidhof.at
  • unterbodenschutzundhohlraumversiegelung.at
  • professionelle-homepage-erstellen-lassen.at
  • entruempelungundraeumungwienundumgebung.at

🤷

International Domains (IDN)

Internationalized Domain Names (IDNs) are domains that contain characters other than a-z, 0-9, or hyphen (or non-ASCII characters). For example, German umlauts (ä, ö, ü), which is why they are often referred to as umlaut domains. Such IDNs have only been available since 2004, and according to nic.at statistics as of April 2023, just under 36,000 have been registered.

IDNs are typically stored and processed as Punycode. Punycode is an encoding system in which special characters are encoded using ASCII characters and appended to the end of the domain. Users don't typically see Punycode in their daily use of the internet because the conversion happens automatically in the background by browsers and servers.

Punycode domains start with "xn--" and end with a hyphen and a character representing the encoded special character. For example, "österreich.at" becomes "xn--sterreich-z7a.at".

There are still some short umlaut domains available: ö.or.at, ü.or.at, ä.or.at, ü.co.at, ä.co.at. "ß" (sharp s) is not supported in .at domains.

The dataset used contains 2.1 % IDNs.

% IDN Domains

IDNs can be used for malicious purposes. By using certain special characters, users can be deceived, so I was curious which special characters other than German umlauts are used in .at domains.

I found 159 .at IDNs that contain other characters than umlauts, such as ç, ë, ó or é. At first glance, there is nothing exciting about them. Perhaps interesting are ímmowelt.at and ìmmowelt.at, which redirect to immowelt.at. Here, the company has obviously taken proactive measures to prevent others from registering such misleading domains.

  • Variants of rené, crêpes, café or caffè
  • Short domains: à, å, æ, ...
  • Brand names: hermès.at, pokémon.at, loréal.at, créditsuisse.at, nestlé.at, ...

VIPs of .at-Domains

There are several TOP lists from different providers. The most well-known is probably the Alexa Ranking list (which has nothing to do with Amazon's AI assistant). However, Amazon has bought this company and the Alexa website is no longer accessible. Currently, the domain list is still available. This will surely change soon.

The ranking methods differ somewhat. In general, some form of PageRank algorithm is applied. This algorithm evaluates the popularity of a website by counting referring pages (inbound links): the more other pages link to a page, the higher it is in the list.

Other TOP lists use traffic as a measure, which is measured by tracking pixels (ÖWA) or browser extensions (Netcraft).

The results are, let's say, "okay". As you can see in the table below, there are some incorrect and some questionable entries. For example, Majestic ranks the "actually" TLDs gv.at and or.at in the TOP 10. Then there are some pages on the list that I have never heard of. The ÖWA only includes sites that voluntarily want to be measured (and pay for it).

"Tranco" is a project that was launched by several scientists and aggregates multiple sources and mathematically weights them. They call it "A Research-Oriented Top Sites Ranking Hardened Against Manipulation"... Okay, you decide!

Attention: Table has a vertical scrollbar.

Place* Majestic [1] Alexa Cisco [2] Netcraft [3] Tranco [4] ÖWA** [5] Similarweb [6] Overall*
1 (10) gv.at orf.at ad4m.at orf.at univie.ac.at orf.at orf.at orf.at (59)
2 (9) google.at derstandard.at google.at bergfex.at orf.at willhaben.at google.at google.at (41)
3 (8) kriesi.at krone.at optadata.at willhaben.at google.at krone.at krone.at willhaben.at (36)
4 (7) univie.ac.at willhaben.at gmx.at harryahamer.at derstandard.at heute.at willhaben.at derstandard.at (33)
5 (6) orf.at google.at waust.at geizhals.at kriesi.at derstandard.at heute.at krone.at (28)
6 (5) shorturl.at hurawatch.at willhaben.at bawag.at shorturl.at meinbezirk.at derstandard.at univie.ac.at (19)
7 (4) or.at shorturl.at orf.at karriere.at krone.at kurier.at oe24.at heute.at (16)
8 (3) derstandard.at heute.at derstandard.at raiffeisen.at world4you.at gmx.at gmx.at kriesi.at, shorturl.at (14)
9 (2) tuwien.ac.at toon.at interspar.at univie.ac.at tuwien.ac.at oe24.at kleinezeitung.at
10 (1) wko.at oe24.at post.at c3w.at wien.gv.at kleinezeitung.at bergfex.at gmx.at (13)

* In the last column, I have counted the results of the table (unscientifically) and the more often and higher up the pages appear in the table, the higher the score (in brackets) and placement in the column.

** The ÖWA data is from February 2023 and sorted by unique users per individual offering.

Sources:

  • [1] Majestic 1 Mio - "The million domains with the most referring subnets"
  • [2] Cisco Umbrella List - "Most queried domains based on passive DNS usage"
  • [3] Netcraft (AT) - "Most Visited Websites in Austria"
  • [4] Tranco - "A Research-Oriented Top Sites Ranking Hardened Against Manipulation"
  • [5] ÖWA - "Österreichische Webanalyse"
  • [6] similarweb - "Top Websites in Austria"

Domain Parts

Domains may contain hyphens, but not at the beginning or end. This makes domains longer but more readable. When domains are divided by hyphens, it provides a good overview of the individual words or parts of speech used in the domain.

  • Linking words or prefixes: my, e, mein, in, und, die, der, ...
  • Branches: elektro, bau, service, it, immobilien, psychotherapie, ...
  • Locations: linz, graz, tirol, salzburg, vienna, austria, a, ...

There are about 460,000 domains that contain hyphens, and dividing them creates a list of 157,000 different words. I used the top 10,000 keywords with more than 3 characters to search for them in other domains. This helps to find keywords that appear in domains without a dividing hyphen.

Keywords in domains (>3 chars)

Other parts of domains: Depending on what you're looking for, you can already deduce a lot from domain names. Here are some examples of evaluations from the domain data.

Domains with years

Top locations in domains

Health sector domains

Domains with prefix

Domains with first names

Domains form hospitality-sector

priv.at Domains. Privatière of domains.

The .priv.at domains are a special domain for individuals that are awarded by the VIBE!AT association (Association for Internet Users Austria).

To get a .priv.at domain, one must be a private individual and be residing in Austria. The domain name can apparently be chosen quite freely, as long as it does not infringe on registered rights. Registration of a .priv.at domain is free.

The dataset contains 294 .priv.at domains. Most of them are first names, last names, or nicknames. The nic.priv.at whois form still shows the full personal data of many domain owners 🤨.

IP addresses

An IP address is a series of numbers that uniquely identifies a device on a network. This allows devices to communicate with each other. IPv4 addresses are still the most common, and anyone who has ever set up an internet router at home, probably knows what IP addresses look like. Something like: 192.168.0.1

Websites also have an IP address, or rather, the servers on which these websites are hosted. If you trace the .at domains back to the IP addresses, you will see that 205,572 domains (15.6 percent) are currently not assigned to any IP address.

% Domains with/without IP

Domainname 👉 IP Address

Not every website needs exactly one unique IP address. It is possible for many domains to point to one IP address, and the server to deliver all these websites. The address could also be a load balancer, behind which a whole network is hidden. This network can in turn be responsible for many domains.

Details: IP-Address to Domain

Find A records with dig

% dig +noall +answer orf.at A
orf.at.                 21481   IN      A       194.232.104.140
orf.at.                 21481   IN      A       194.232.104.139
...
orf.at.                 21481   IN      A       194.232.104.149

Find A records with nslookup

% nslookup orf.at
Server:         192.168.50.1
Address:        192.168.50.1#53

Non-authoritative answer:
Name:   orf.at
Address: 194.232.104.3
...
Name:   orf.at
Address: 194.232.104.141

Find IP address with ping

% ping orf.at
PING orf.at (194.232.104.3): 56 data bytes
64 bytes from 194.232.104.3: icmp_seq=0 ttl=54 time=25.222 ms

The 1.3 million .at domains lead to one of 112,162 IP addresses, so it would then be one IP per 10 websites on average. This average is due to a few IPs that are responsible for many thousands of domains. These IPs belong to large hosting companies (e.g., World4You, Host Europe, ...) or domain parking services (e.g., Sedo).

Domain count per IP address

The leader here is the IP address 81.19.154.98 from World4You, which is responsible for 27,802 .at domain names. If you call up the IP or the respective websites, you can quickly see why this is the case. Either you will be redirected to another domain immediately (Domain Redirect Service) or you will end up on a domain parking page.

Domain parking screenshot

Who is Who? Autonomous System Number (ASN)

How do you actually know who is responsible for an IP? To find this out, there are several ways: Some information can be seen in the DNS PTR record. This is essentially the counterpart to the A-record and establishes the reverse link from IP to domain name. The request is made with an IP and you get a hostname in return.

Details: PTR lookup with dig
% dig +noadditional +noquestion +nocomments +nocmd +nostats -x 142.251.37.3
3.37.251.142.in-addr.arpa. 26701 IN     PTR     muc11s23-in-f3.1e100.net.

The easy and fast way is to use an IP/Geo-Location database, like the Maxmind GeoLite DB. Such databases contain many IP addresses and the associated organizations that can be identified via ASN (Autonomous System Number).

Details: IP Lookup with PHP and GeoLite2
# composer require geoip2/geoip2

require 'vendor/autoload.php';

use GeoIp2\Database\Reader;

// Confused? The "City"-DB contains also countries
$reader = new Reader('/usr/local/share/GeoIP/GeoIP2-City.mmdb');
$record = $reader->city('128.101.101.101');

var_dump($record->country->name);

The top 10 responsible organizations by ASN (Autonomous System Number) are mostly located in Germany and are large hosting companies like Hetzner, Host Europe, IONOS, or even Amazon. A special case is Cloudflare, which operates a large international network in the field of security and performance and already includes 6,832 IP addresses (related to .at domains).

IPs by ASN (Organisation)

Analyses of IP Blocks

When looking at the number of IPs and domains on /24 IP blocks, you can see that some hosters distribute their domains across multiple blocks and IPs, while others host a large number of domains on a single block with very few IPs.

An IPv4 /24 IP block is the address range from 1 to 254 at the last position of the IP address. So, for example, from 81.19.159.1 to 81.19.159.254. In the next two charts, you can see the top 10 blocks with the most domains and how many IP addresses were used within the block.

IP count with domains in /24 IP blocks

The top 3 blocks host a similar number of domains (41 to 46 thousand). IONOS distributes these domains across the entire block (254 IP addresses) and World4You uses only just over half of the IP addresses in one block but uses multiple blocks.

Domain count in /24 IP blocks

What else could be done with IP blocks? In IP blocks where only some addresses are used, the question is, of course, what can be found on the other addresses? Either they are not used, or they contain systems and domains that have not yet been discovered.

Below is an anonymized example of an IP block scan for IP addresses not yet present in the dataset. A network ping is sent, and ports 80 and 443 are scanned. There would still be quite a bit to find, but I've left it at that for now.

Chart IP block scan
IP block scan

IP Geo Locations. Dude, Where's My Domain?

The evaluation of the 112k IP addresses shows that the IP addresses are located in 95 different countries. The majority, namely 52,931 IPs, are in Germany. If you add up Austria, the USA, France, and the Netherlands, there are still more IP addresses with .at domains hosted in Germany.

21 countries are each responsible for only one or two IPs (e.g., Iraq, Georgia, Colombia, ...). In Russia, there are 191 IPs of .at domains and 47 are in China.

Top 10 IP count per country

There are some free providers of IP/GeoLocation databases or APIs that can help you find the location, country, and responsible organization of an IP address.

The location of an IP address can change from time to time. For example, when IP blocks are resold to other service providers or when a service provider uses a block in a different data center.

Network Ports

Ports are used to separate the communication of applications and services on a device from each other and thus ensure orderly and efficient communication over the network. Each port is identified by a unique number called a port number, which ranges from 0 to 65535.

Typically, specific port numbers are reserved for specific applications and services. From this, conclusions can be drawn about the software used or at least about the available services of the server.

Sources and Research

To scan or have it scanned? Port scanning is not complicated and can be done with tools like Nmap or a few lines of source code. However, port scanning is time-consuming, if you want to scan many ports per domain. Fortunately, someone has already done this, and the results are available via a free API.

Addition: Port scan with PHP

For simple port scanning, you don't necessarily have to work with Nmap. Whether a port is open or closed can be quickly and easily checked with PHP.


$ipOrHost = 'www.parlament.gv.at';
$port = 25565;
$timeout = 0.2;
$connection = @fsockopen($ipOrHost, $port, $errorCode, $errorMessage, $timeout);

if (is_resource($connection)) {
    echo 'OPEN';
}

The Shodan InternetDB is designed to quickly find security vulnerabilities and, for example, monitor your own IP. If, like me, you already have an extensive list of domains and their associated IP addresses, Shodan can be used to find out which ports are open on the respective IPs.

Almost all evaluations in this section are based on Shodan data, only the ports for HTTP vs. HTTPs have been scanned by me. Lists of default ports for different services and software can be found all over the internet. See, for example, Wikipedia TCP/UDP Port Numbers or Secbot (Common Ports)

Screenshot Shodan Database

IP addresses with Shodan entry

Using the API, 112,162 IP addresses were checked. Of the checked addresses, 95,717 IPs (85%) have an entry in the Shodan database.

Percentage of IPs with/without Shodan entry

Barn Door. Open Ports Per Address

In general, one should make sure that not too many ports are open on a server, as each open port is a potential attack surface from the outside. However, if servers offer many services or act as a gateway for other servers, it can happen that 20 or more ports are open.

42 % of the IP addresses have fewer than 5 open ports, and it is usually a typical combination for web servers consisting of 80, 443, and either SSH (22), FTP (21), SMTP (53, 587), or MySQL (3306).

Count of open ports per IP

Detect Typical Services By Port

"What you see is what you get." - The evaluation resulted in a list of 1,298 different port numbers. Many of these ports are used only occasionally. The most commonly used ports are from the areas of email (IMAP, PHP, SMTP), file transfer (FTP), shell access (SSH), and logically the standard ports intended for web content (80, 443).

Number of IPs with typical ports

HTTP vs. HTTPs

Web content is usually delivered on ports 80 (HTTP) or 443 (HTTPS).

A few years ago, Google pushed website operators to switch to the more secure HTTPS. Many did so at the time, fearing lower rankings, and switched to HTTPS. Major browser manufacturers also introduced measures to encourage website operators to make the switch. Connections were marked as "insecure" or some browser features were only available for HTTPS pages.

Therefore, it is not surprising that 88 % of .at websites can be reached via HTTPS and HTTP. 7 % of the checked pages are only accessible on HTTP (Port 80), and only 3 % of the pages are configured for HTTPS-only.

Percentage of open HTTP(s) ports

It is possible for web content to be delivered on ports other than 80 and 443. The most common ports are variants of 80 and 443, such as 8080, 8000, 8081, or 9443. Port 3000 is also often used in node.js environments.

Such ports are sometimes used to operate a development or test environment alongside the actual site. The integration of 3rd-party systems into the network landscape is also achieved via alternative ports.

12 % of all checked IPs deliver web content on port 8880, 9 % on 8080, and 2 % on 8081. Upon further inspection, it becomes apparent that 8880 is almost always a Plesk login. On port 3000, you often encounter the Grafana statistics software. Other ports deliver error messages or blank pages or are subsequently blocked by a firewall (e.g., FortiGuard).

Occasionally, you'll come across login pages for other web-based applications: i-MSCP, ISPConfig, Kibana, Roundcube, ... Is this a problem? Yes, no, maybe! It depends on how secure the passwords used are and whether the respective software is regularly updated.

Note: Now it makes sense to take a short detour to the National Vulnerability Database. There, you can find current and past security vulnerabilities, searchable by software and exact version. Here's an example for "Roundcube." If you find vulnerabilities for the software and version there, then it's a problem.

Screenshot NIST NVD

Databases and Data-Stores

In most cases, databases should not be directly connected to the internet because it provides an attack surface that can be easily avoided. In some cases, however, it is unavoidable, and then the database in use can be determined via port scanning. In the checked IPs, there are about 22,000 open database ports, and the vast majority of them, 18,091 to be exact, are MySQL databases.

All popular databases (data-stores) have defined default ports that are usually not changed. However, some databases use very generic ports (8443, 9443, or 8080) that are not considered in this evaluation because these ports would not provide meaningful clues.

Number of IPs with open DB ports

Web-Application Server Ports

A web application server provides an environment in which web applications can be executed. This includes the provision of runtime environments, frameworks, database connections, security features, and other services required for the development and operation of web applications. The used web app servers are often recognizable from the outside by their default ports.

Number of IPs with evidence of app servers

Ports Encore - Minecraft Server

We have to answer the most important question at the end of the chapter 😜. Are there IPs with open Minecraft ports in the dataset? Yes. 191 IP addresses have an open port 25565, indicating an active Minecraft server.

Minecraft Screenshot

And can you connect? The first two servers had a different version, and for the next two, I wasn't on the player whitelist, but on the 5th server on the list, I was able to log in, even though the server and the IP address were not known as a Minecraft server anywhere on the Internet.

DNS Records

DNS records connect IP addresses with domain names. In simplified terms, a DNS server is a large database that contains different types of entries and is mirrored on many other servers on the Internet.

Sources And Research

In the IP addresses section, this database was already used to determine IP addresses. However, DNS records can reveal even more interesting things. DNS entries can be read with various tools. The analyses in this section were all done with the command-line tool dig, saved, and then parsed with PHP. The dataset comprises 6,288,955 of 888,450 domains and is 347 MB in size.

For security reasons, some name servers do not respond to so-called ANY queries, which return all records of every type. Therefore, in the first step, the 7 most important entry types (NS, TXT, A, AAAA, CNAME, MX, SOA) were individually queried.

To obtain the most comprehensive picture possible, in the second step, ANY queries were sent to all domains, and the analyses related to this are presented in the "DNS ANY queries" section.

Addition: Retrieving DNS records with dig

The command-line program dig can be used to make queries to nameservers. Here, all records (ANY) for orf.at are requested from the nameserver with the IP address 8.8.8.8 (Google).

dig +noadditional +noquestion +nocomments +nocmd +nostats orf.at ANY @8.8.8.8
Screenshot dig Example

Record Types

First of all, an overview of the types and frequency of DNS records. Common types include NS, MX, A, AAAA, TXT, CNAME, and SOA records.

DNS records by type

DNS records by host/type

Each website needs an NS, A, and SOA record, otherwise, it would not work. 853,055 domains have entries of these three types. As can be seen in the chart on the left, some types of entries (e.g., NS) are assigned multiple times to increase fault tolerance.

AAAA records are IPv6 addresses. For many years, it has been warned that all IPv4 addresses have already been assigned and that everyone should switch to IPv6. However, apparently, that has not happened yet.

MX Records - The More, The Better?

The Mail Exchange (MX) DNS records indicate which mail server is responsible for the domain. This can provide insights into the software and cloud providers used.

I would have expected most domains to have at least two or more MX records. This would increase fault tolerance and would hardly require any extra effort. As the chart on the left shows, this is not the case, and there is a reason for it.

The individual MX records point to subdomains (e.g., xyz.mail.protection.outlook.com) that have multiple IP addresses (A records) assigned to them. If we include this in the calculation, the chart (on the right) looks somewhat different.

Count MX records
by domain

Count MX records by domain
(Multi IPs included)

If you are building airplanes and want to make them really safe, you can simply create 12 MX records.

Screenshot Boing DNS records

741,491 domains have at least one MX record and can theoretically receive emails. The top 10 mail servers include major hosting companies and cloud providers.

Mail Server Vendors

Domains that point to outlook.com use a Microsoft 365 (formerly Office 365) product. MX records that contain "aspmx.l.google.com" indicate a Google Workspace product.

Mail server hosts

When looking at the distribution of domains among mail servers, it becomes apparent that the top 15 are responsible for more than 5,000 domains each. After that, there are around 60 mail providers that manage emails for 500 to 5,000 domains. 7,814 smaller hosts or larger companies are responsible for 2 to 500 domains. The majority of mail servers, namely 282,377, are responsible for exactly one domain.

Domains by mail server

External vs. Internal MX domains: If the domain matches the mail server domain, it is likely that a local mail server is responsible for receiving emails. Different domains can indicate the use of a cloud provider or the hoster operating a centralized mail service.

Mail servers external/internal

Note: When looking through the MX list, it can be seen that around 200 domains have an MX record from the Russian cloud providers Yandex or mail.ru. The Chinese mail providers qq.com and 163.com are only used by 6 .at domains.

TXT Records. More in, for you!

In TXT records, you can write anything you want, as long as it doesn't exceed 255 characters. If you want to pack more text into DNS, just create multiple TXT records. In my scans, I found 560,400 TXT records for 417,348 different domains.

Domain count with n TXT records

Most domains have exactly one TXT record, while some have as many as 50 records. I mean, "Sparkasse, quo vadis?" (where are you going?) I hope that's intentional.

Screenshot Sparkasse DNS records

In recent years, TXT records have been increasingly used to verify ownership of a domain. When a domain is to be used with a cloud provider, the owner is asked to create a special TXT record so that the provider knows the domain really belongs to the customer.

By the type of record, it is often possible to tell which domain is working with which SaaS provider. Providers such as Google, Facebook, Apple, and Zoho use this type of domain verification.

Verify TXT records

Based on the Verify TXT record alone, one cannot tell which specific product is being used by each provider. One can only say in general that 53k domains are using something from Microsoft and 44k are using some product from Google.

Many other providers use TXT records to store configurations or verifications for specific products. During a brief review, I found more than 70 providers:

Amazon SES, BMD, Barracuda, Brave, Cisco, Citrix, Cloudflare, ClubDesk, DigiCert, Docker, Drift, DUO, Dynatrace, Elastic Email, Firebase, Fortinet, Freshdesk, GitHub, GitLab, GlobalSign, HIBP, Hornetsecurity, HubSpot, IBM, Indeed, Infoniqa, Jimdo, KnowBe4, MIDOCO, MS Dynamics, MS Office365, Mailjet, Mailru, Mandrill, Microsec, Mimecast, Miro, MongoDB, nameshield, Offensity, OneTrust, Oracle, Pardot, Plesk, Postman, Protonmail, Rexx, SAP, Salesforce, SendGrid, Sendinblue, Seobility, Shopify, Sipgate, Smartsheet, Sophos, Spycloud, Squarespace, Stripe, TITAN, TOPDesk, Trend Micro, Trustpilot, Webex, Webflow, Wix, Wordpress, Workplace, Wrike, Yandex, Zendesk, Zoom, blackscreen, dan.com, eRecruiter, flexera, iCIMS, mailEnable, proofpoint, sevDesk, site24x7, successfactors, x-mailer, Yandex

TXT Records for SPF and DMARC

Many of the providers and products can be found in the SPF information. SPF stands for "Sender Policy Framework" and is designed to make email communication more secure. An SPF record lists all the mail servers that are allowed to send emails on behalf of the domain. Therefore, many marketing, support, and sales systems can be found in the values.

380,332 domains are secured by the Sender Policy Framework. Another protocol that is designed to improve email security is DMARC, but it is only used by 752 domains.

TXT records for SPF/DMARC

TXT Records: Specials

TXT - "domain blocked": I found the TXT entry "domain blocked" in 10,755 domains. Checking some samples showed that all these domains appear in the nic.at Whois with the status "pendingDelete". Therefore, these could be TXT entries from nic.at for domains that do not comply with the rules.

TXT with advertising: You can always be pretty sure that wherever there is a text field, someone will come up with the idea of filling it with advertising. Creativity or desperation?

IBM ads in DNS

DNS ANY Requests: Everything at Once

The previous analyses have focused on specific DNS types. ANY queries are not supported by all servers because they are complex and slow (see Cloudflare Blog). However, what do you get when you send ANY queries to the name servers? Surprisingly, many servers respond, and types such as CCA, HINFO, RRSIG, DNSKEY, and DS are widespread among .at domains.

SPF - "You're Doing it Wrong!": 0.5 % of the domains have entries of type "SPF", which are actually wrong because SPF entries must have the type "TXT" and should be changed accordingly. Although the type existed at some point, it was revoked in 2014 and is no longer supported by many servers (see Wikipedia).

HINFO: By indicating RFC8482 or "ANY not supported" in the HINFO entry, it is shown that no ANY queries are answered by the server. Currently, this applies to 6 % of the queried domains.

DNSSEC entries: DNSSEC stands for Domain Name System Security Extensions and is intended to ensure the authenticity and integrity of DNS entries through digital certificates. There are different types of entries that all have a specific role in DNSSEC: DNSKEY, RRSIG, NSEC, NSEC3PARAM, DS. About 3 % of .at domains use DNSSEC.

CAA entries: CAA entries were specified for 1 % of domains, indicating which Certificate Authorities are authorized to issue certificates for this domain. Typically, providers such as geotrust.com, letsencrypt.org, or digicert.com are listed in this record.

HTTP-Header & Cookies

HTTP headers are pieces of information sent from a web server to a client (such as a web browser) to provide additional details about the transferred data. This information can include various things such as the type of content, response size, cache instructions, and more.

Out of the 1.3 million domains initially accessed, 888,450 provided a response and a total of 6,288,955 HTTP response headers were stored. An overview of HTTP headers can be found on MDN, Wikipedia, or at OWASP.

HTTP Server Header. Who is serving?

The server header is a response header that is optionally sent by web servers to identify the name and version number of the web server or server software being used. The server header can potentially reveal sensitive information that attackers can exploit to find vulnerabilities.

Apache is still twice as large as Nginx, responsible for delivering only 160,585 websites while Apache handles 305,249 websites. 555,428 websites returned one of 610 different server headers.

HTTP server header

HTTP Powered-By Header. Energize!

The HTTP header "X-Powered-By" is an optional HTTP response header field sent by web servers to identify technologies being used. The X-Powered-By header may contain information such as programming language, web server software, database software, and other technologies being used.

22.8 % of homepages send an X-Powered-By header. 475 different headers were stored, with the TOP 10 led by PHP, Plesk, and ASP.

Pages with X-Powered-By

HTTP X-Powered-By Header

HTTP CSP (Content Security Policy) headers are a mechanism for protecting web applications from cross-site scripting (XSS) and other attacks. By defining trusted sources, the injection of malicious data into the web page is intended to be prevented.

Only 0.7 % of websites send a CSP header. This seemed low to me, until I remembered, that CSP can also be defined in HTML using a tag. Therefore, the numbers from the CSP HTML evaluation were integrated here. After that, it was a "whopping" 0.9 % 🙄.

Domains with/without CSP

Website Age. Old but gold?

138,221 websites provided both a Date and a Last-Modified header. The calculated age of the content shows that there are many websites that update daily, but also many whose content has not changed for over two years.

Domains with/without Age

Age in days

Cookies. Cookies. Cookies.

Cookies are data stored by a website on a user's computer or mobile device when they visit the site. Cookies contain information about activities on the page or specific settings.

Cookies have had a "small" image problem for a few years because they have been used for things that were not intended when they were invented. That's why website operators must ask users beforehand if they can set cookies, at least for all cookies that are not technically necessary. Of the 578,385 websites crawled, 29.4% set cookies on page load (without asking). Hopefully, only technically necessary ones?!

Pages with cookies

There are pages that use cookies as a kind of "database" and store all kinds of stuff in them. That's why the top performers in the statistics set a double-digit number of cookies as soon as you open the page. 😳

TOP 10 - Cookie count on page

The most commonly used cookies are typical session cookies (e.g., PHPSESSID, beng_proxy_session, ...), cookies for security-related features (e.g., XSRF-TOKEN), and cookies that store settings (e.g., localization, pll_language).

TOP 10 - Cookie names

Unfortunately, there are also cookies that probably do not fall under the "technically necessary" category. Cookie names such as facebookPixel, remarketing_cid, SC_ANALYTICS_GLOBAL_COOKIE, ad_storage, gtm, trackings, ... already sound suspiciously like advertising, analytics, or tracking? Fortunately, they are only found occasionally.

Website Speed: Need for Speed

When downloading the homepages, the time required to download them was measured. Only 2% could be downloaded within 100 milliseconds. 22.8% required between 100 and 249 ms. For the largest proportion of pages (39.2%), downloading took between 250 ms and 500 ms. For 16.3%, downloading took longer than one second to complete.

Page speed

More than 500 ms is already relatively slow for a website. One must keep in mind that only the download of the HTML is considered here and the time is measured until the HTML is completely received by the browser. Stylesheets, images, and scripts still need to be loaded and executed there.

HTML Structure And Tags

HTML stands for "Hypertext Markup Language" and is the markup language used to create web pages. It is used to define the content and structure of a web page by including various elements like headings, text, images, and links, which are interpreted and displayed by a web browser.

HTML consists of a set of tags (markup elements) that tell browsers how to display the content. For example, the <h1> tag can be used to create a first-level heading, while the <p> tag is used to define a paragraph.

In April 2023, I attempted to download 1.3 million homepages. Before downloading, a check was made on port 80 or 443, and only responses without errors were considered (HTTP status code: 200). As a result, I was able to save the HTML of 578,385 homepages and analyze them afterward.

Details: Crawling with crwlrsoft/crawler

With the GitHub package crwlrsoft/crawler from crwl.io, you can easily and quickly develop web crawlers that download entire web pages or only parts of them.

use Crwlr\Crawler\Steps\Loading\Http;

$crawler->input('https://www.orf.at/')
    ->addStep(Http::crawl()->depth(1));

HTML Size

The proud surprising winner of the size comparison, is a website with just over 30 MB. Please note, this is only HTML and no images, JS, CSS, or anything else. One might think that there is a lot of meaningful content on it, but unfortunately, that's not the case. The page was obviously created with Microsoft Word and exported to HTML, containing a lot of invisible code.

Screenshot 30 MB of HTML

Fortunately, such page monsters are rather rare (only 3,406 or 0.58 % have more than one megabyte). Just over 50 % of the 577,552 HTML responses are between 1 and 50 kilobytes, and another 34 % are between 50 and 256 kilobytes.

Distribution of HTML sizes

HTML Tags vs. Content

How big is the proportion of content to markup, actually? If you remove all HTML tags from the markup, you get the pure content of the page. More than 85 % of websites have a content proportion of up to 60 %. Only just over 15 % of pages have a content proportion of more than 60 %.

What is the content proportion?

HTML-Tags per page

The maximum number of HTML tags on a page is 271,726. This page is 21 megabytes in size and is managed with Wordpress. The commenting function is set up so that anyone can comment, resulting in 19,537 spam comments on the page. 🤕

Top-10 HTML-Tags

DIV does not stand for "diverse", but it is used for that purpose. Not surprisingly, there are so many <div> tags in HTMLs. But links (<a> tags) in second place and <script> and <link> tags in the TOP 10 are quite interesting.

HTML Tags For SEO

SEO stands for "Search Engine Optimization" and refers to the practice of designing and optimizing websites to appear as high as possible in search results.

There are specific recommendations for how a well-optimized SEO page should be designed. First, the HTML should be "well-formed", which means there should be no errors in the markup. You can easily check this with HTML Tidy. Only 6.9 % of the pages are error-free, 84.1 % have warnings (small errors), and 9 % of the pages have major HTML errors.

HTML errors and warnings

HTML with SEO tags

A few basic things should be found on an SEO-optimized page. 21.7 % of the tested pages contain a DOCTYPE, <title>, <meta>-Description, exactly on <h1> and min. one <a>-Tag.

Addition: Tags and quality with PHP

HTML Tidy can be used directly in PHP, as there is a native PHP extension for it.

# sudo apt-get install php8.2-tidy

$tidy = tidy_parse_string($htmlContent);
$tidy->cleanRepair();
$tidy->diagnose();

var_dump($tidy->errorBuffer);

I prefer to work with HTML using the Symfony DomCrawler extension.

# composer require symfony/dom-crawler
# composer require symfony/css-selector

use Symfony\Component\DomCrawler\Crawler;
use Symfony\Component\CssSelector\CssSelectorConverter;

$crawler = new Crawler($html);
$titles = $crawler->filterXPath($converter->toXPath('title'));
var_dump(count($titles));

Unlike non-semantic tags like <div> or <span>, which are only used to structure sections of the webpage or group elements, semantic tags provide specific information about what type of content is in the tag.

Commonly used semantic tags include: <header>, <footer>, <main>, <aside>, <nav>, and <section>. 71% of the tested web pages use semantic block-level HTML tags.

HTML with semantic tags

schema.org in HTML

Structured data in HTML are a standardized format for displaying information about the content of a web page. These metadata are used by search engines, social media, and other web services to better understand the content and context of a web page.

Screenshot of schema.org

On schema.org, a collection of metadata specifications (called "schemas") is published that developers can use to make web pages more machine-readable. The most commonly used elements relate to technical details such as EntryPoint, SearchAction, BreadcrumbList, ... or the general indication that it is a WebSite or WebPage.

TOP 10 schema properties

Inbound / Outbound Links

Links are still an important ranking factor for all major search engines. Who is linking to whom and how often? Within the .at domains, the most linked pages are herold.at, google.at, and wko.at.

TOP 10 linked pages

Most of the list is not surprising, but why is the Federal Chancellery (bka.gv.at) so well-linked? The Legal Information System can be reached under the subdomain ris.bka.gv.at, and many companies link in the footer to the current Trade Act, which can be found in the RIS. Mystery solved!

Screenshot of Rechtsinformationssystem (RIS)

The highest number of outgoing links to different .at domains is found on the page: museen-in-oesterreich.at. There are 561 .at domains linked there and a total of 1,551 external links.

Outbound links histogramm

<img>-Tags. Bilder in deinem Kopf.

The <img>> tag is used to insert an image into a web page. The tag has a required attribute "src" which specifies the URL of the image to be inserted. I found 8.7 million images on 489k pages, of which 2 million had an ALT attribute.

<img> Total

8.745.391

Seiten with <img>

489.633

<img> with ALT

2.090.312

Max. on one page

27.377

One page even had 27,377 images embedded on the page, and because no one would believe it, I have a proof screenshot here. It's not as bad as it seems at first, the images are lazy-loading and the page is "okay" fast. Still not optimal.

Screenshot of the 27k website

The "alt" attribute is an important attribute for images. It indicates what is shown in an image if the image cannot be displayed for any reason and also serves to improve the website's accessibility for people with visual impairments.

Do important things appear in "alt" attributes? "Logo" is the word that appears by far the most in "alt" texts. You can find an image with "Logo" alt text on 128,494 pages. I particularly like "alt" texts that contain "Image" (10,685 pages), "Bild" (5,496 pages), "Foto" (4,128 pages), or "Icon" (9,370 pages). Not.

Often used words describe either menu elements or social media links or sharing buttons. Below is a small overview of these two categories.

Navigation Icons
menu, menü, menue, menu-icon, mobile-menu, submenu, ... 62.508
home, haus, homepage, ... 16.864
arrow, pfeil, arrow-right, arrow-left, pfeilchen, abwärtspfeil, previous, next, richtungspfeil, pfeil-icon, navigationspfeil, pfeillinks, pfeilrechts, ... 13.001
icon, icons, ... 9.662
burger, burger-menu, bento ... 866
Social Media Icons
facebook, facebookicon, facebook_pixel, facebook-logo, ... 18.602
instagram, instagramicon, instagramm, ... 9.934
youtube, youtubeicon, social-logo-youtube, ... 8.178
linkedin, linkedin-logo, ... 3.603
tiktok, tiktok-logo, #tiktok, ... 498

Google Tags - "Lawyers Love This Trick"

If you are a data protection warning lawyer, and you are looking for a source of income, a disciplinary proceeding at the bar association does not deter you and you want to stand trial for serious fraud, then you can "surf" the 95,102 websites with Google Fonts yourself and then issue warnings. 🤡 ... and please don't forget the 130 .gv.at pages.

Startpages with Google tags

Data Export with <script> or <link>

By now, everyone knows that Google Fonts are "warnable". However, the problem with external links that send private IP addresses to the USA or other countries is bigger and affects virtually all external resources that are included using <script> or <link> without user consent.

If you can find 95k pages with Google Fonts, how many pages actually include external resources from other domains? I was able to find a total of 379k domains (65 %) that use external scripts or stylesheets and thus potentially transfer personal data to third parties.

TOP 20 external resources domains

  • [1] googleapis.com, googletagmanager.com, gstatic.com, google.com, google-analytics.com
  • [2] jimstatic.com, jimcdn.com
  • [3] parastorage.com, wixstatic.com
  • [4] wp.com, wp.me

Resources (scripts or stylesheets) from all Google domains together are embedded on over 221k websites. These include Google Fonts, Google Analytics, and Google Tag Manager, for example. The content management systems Jimdo, WIX, and WordPress are in the "Top Ten". World4You (place 3) appears again here because so many .at domains are parked there.

Website Content

In this section, all analyses that were conducted based on website content but are not specifically related to HTML or other technologies are included.

Content is King.

The word count, after removing HTML, scripts, and styles, ranges from 0 to 600,000 words (... again, the Wordpress page with open comments 😳). On 29% of websites, less than 100 words are written. Some of these probably load content using JavaScript, and the others are placeholder pages or pages of people who like to keep it brief.

Word count on websites

Legal Notice on Website

In Austria, the disclosure requirement for electronic media under the Media Act (commonly referred to as an imprint obligation) has existed for several years. Since both commercial and private websites require an imprint, one could assume that the word "imprint" (or an English form) appears somewhere on every page?

Websites with/without legal notice

I was able to find the word "imprint," "Imprint," "Legal Notice," or "Legal Disclosure" on 64.67% of websites. Now we still need to account for pages that load all content through JavaScript and websites in languages other than German or English. Speaking of languages, what languages are found on Austrian websites?

Languages

With the PHP package patrickschur/language-detection, languages can be detected. I shortened the contents of the websites to 50 words and then performed language recognition. Not surprisingly, 70% are in German, for 20%, recognition was not possible, and 8.5% are in English. Among the other 1.5% of languages, mainly languages from neighboring countries can be found.

Website language


TOP 10 - Other: Czech (680), Vietnamese* (279), Polish (219), Hungarian (184), Slovak (176), Russian (133), Swedish (90), Slovenian (84), Turkish (80), Serbian (72)

* Addendum: Vietnamese .at pages?

279 pages in Vietnamese? That seemed fishy to me. And it is. All of these pages deliver the same content (see screenshot), which comes from a landing page platform in Vietnam. The domains belong to a domain reseller in Germany. Many of the pages are marked as phishing sites by Google Chrome.

Chrome Screenshot

Something is rotten in the state of Denmark. Currently, the pages seem to be harmless, but that can change at any time if new content is played out.

Gender-inclusive language

Gender-sensitive language is still not really common on Austrian websites. If you search the 345,105 pages with more than 200 words for typical gendering forms, you will only find them on 14 % of the homepages. The search was conducted for the recommended/common forms with "binnen-I" (In, Innen), asterisk (*in, *innen), colon (:in, :innen), slash (/in, /innen), and underscore (_in, _innen).

Gendering on websites

Regarding gendered forms, "binnen I" is the most popular at 8 %, followed by colon (4 %), asterisk (4 %), and trailing behind are slash (2 %) and underscore (0.3 %). The most commonly gendered words are: Mitarbeiter (employee), Kunde (customer/client), Schüler (student), Teilnehmer (participant), and Patient (patient).

Gender-Forms

Gendered words

* For the singular "binnen I" form (In), I had to do some manual post-editing and remove all obviously incorrect words, such as LinkedIn, LogIn, CheckIn, etc.

The list of gendered words contains 11,433 different words. In addition to many common terms, you can also find words that are not used so often: Trickdogtrainer/in, Corona-Verharmloser*innen, Woidarbeiter*innen, Qualitätsröster/innen, DownhillerInnen, Clown*innen, Hackbrett-Künstler/in, Wildtierschmuggler:innen, ViewerInnen, ...

Summary

I hope I was able to show through these examples that a structured analysis of websites can reveal a lot about your competitors or potential customers. The focus of the evaluations here was more on technical indicators and superficial analyses. However, the possibilities are much more extensive and can provide valuable insights for your business.

Author, Legal And Privacy Policy

Michael Feichtinger. Developer and consultant. After developing 10 years in various web agencies and almost 15 years at a large Austrian job portal, now self-employed and available for hire. I deal with all topics related to web development, technologies, and processes in development departments.

You can contact me on Twitter, LinkedIn or per Mail.

Privacy Policy: No personal data is collected, processed, stored, or shared on this site. No cookies are set, and no tracking is integrated or other external dependencies.

Legal: Klosterstraße 3, 4020 Linz, Webdevelopment and Consulting, Member of WKÖ, Authority: Bezirkshauptmannschaft Linz, GISA: 35488286

Versions

  • 2023-05-01 - Version 1.0
  • 2023-05-10 - Version 1.1 - Section Gender-inclusive language added