Random Acts of Architecture

Wisdom for the IT professional, focusing on chaos that is IT systems and architecture.

Tag Archives: Privacy

Twitter will never outsource its algorithm

Elon Musk sparked controversy with this recent attempt to take over Twitter. Many support him, citing Twitter’s relatively poor revenue and Musk’s success in turning seemingly unprofitable ventures, like electric vehicles and space exploration, into successes.

However, his recent Twitter poll caught my interest, where 82% of over one million responses voted that Twitter should open source its algorithm.

Musk explained further during interviews at the TED 2022 conference. The “Twitter algorithm” refers to how tweets are selected then ranked for different people. While some human intervention occurs, social networks like Twitter replace a human editorial team’s accountable moderation with automation. Humans cannot practicality and economically manage and rank the estimated 500 million tweets sent per day.

By “open source”, Musk means “the code should be on Github so people can look through it”. Hosting software code on github.com is common practice for software products. Third parties can examine the code to ensure it does what it claims to. Some open sourced products also allow contributions from others, leveraging the community’s expertise to collectively build better products.

Musk says “having a public platform that is maximally trusted and broadly inclusive is extremely important to the future of civilization.” People frequently demonize social networks for heavy-handed or lax “censorship”, depending on their side in a debate. Pundits claim social networks limit “free speech”, conveniently forgetting “free speech” means no government intervention. Pundits cite examples of the algorithm prioritizing or deprioritizing tweets, authors or topics. They also cite account suspensions and cancellations, sometimes manual and sometimes automated.

Musk assumes that explaining this algorithm will increase trust in Twitter. He called Twitter “a public platform”, implying not just public access but collective ownership and responsibility. If people understand how tweets are included and prioritized, the focus can move from social networks to the conversations they host.

Unfortunately, understanding and trust are two different things. Well understood and transparent processes, like democracies’ elections or justice systems, are not universally trusted. No matter the intentions or execution of a system, some people will accuse it of bias. These accusations may be made in ignorant but good faith, observe real but rare failures or be malicious and subversive.

Twitter’s algorithm is not designed to give equal exposure to conflicting perspectives. It is designed primarily to maximize engagement and, therefore, revenue. It is not designed to be “fair”. Social networks are multibillion dollar companies that can profit from the increased exposure controversy brings. Politicians alienate few and resonate with many when they point the finger of blame at Twitter.

Designing an algorithm for fairness is practically impossible. You can test for statistical bias in a numeric sample set but not across the near entirety of human expression. Like the philosophers opposing the activation of Deep Thought in Douglas Adams’ Hitch Hiker’s Guide the Galaxy, debating connotation, implication and meaning across linguistic, moral, political and all other grounds is an almost endless task.

Assuming transparency can assure trust and fairness is possible, open sourcing the Twitter algorithm assumes the algorithm is readable and understandable. The algorithm likely relies on complex, doctorate-level logic and mathematics. The algorithm likely includes machine learning, which uses no defined algorithm. The algorithm likely depends on custom databases and communication mechanisms, which may also have to be open sourced and explained.

This complexity means few will be able to understand and evaluate the algorithm. Those that can may be accused of bias just like the algorithm. Some may have motivations beyond judging fairness. For example, someone may exploit a weakness in the algorithm to unfairly amplify or suppress a tweet, individual or perspective.

Musk’s plan assumes Twitter has a single algorithm and that algorithm takes a list of tweets and ranks them. It is likely a combination of different algorithms, instead. Some work when tweets are displayed. Some run earlier for efficiency when tweets are posted, liked or viewed. Different languages, countries or markets may have their own algorithms. To paraphrase, J.R.R. Tolkein, there may not be one algorithm to rule them all.

Having multiple algorithms means each must be verified, usually independently. It multiplies the already large effort and problems of ensuring fairness.

Musk’s plan also assumes the algorithm changes infrequently. Once verified, it is trusted and Twitter can move on. However, experts continue to improve algorithms, making them more efficient or engaging. Hardware improves, providing more computation and storage. Legal and political landscapes shift. Significant events like elections, pandemics and wars force tweaks and corrections.

Not only do we need to have a group of trusted experts evaluating multiple complex algorithms, they need to do so repeatedly.

Ignoring potentially reduced revenue from algorithm changes, open sourcing Twitter’s algorithm also threatens Twitter’s competitive advantage. Anyone could take that algorithm and implement their own social network. Twitter has an established brand and user base in the West, but its market share is far from insurmountable.

There are other aspects to open sourcing. For example, if Twitter accepts third party code contributions, it must review and incorporate them. This could leverage a broader pool of contributors than Twitter’s employees but Twitter probably does not need the help. Silicon Valley tech companies attract good talent easily. Some contributions could contain subtle but intentional security flaws or weaknesses.

If the goal is to have a choice of algorithms, is this choice welcome or does it place more cognitive load on people just wanting a dopamine hit or information? TikTok succeeded by giving users zero choice, just a constant stream of engaging videos.

Evaluating an algorithm’s effectiveness is more than just understanding the code. It requires access to large volumes of test data, preferably actual historical tweets. Only Twitter has access to such data. Ignoring the difficulty of disseminating such a huge data set, releasing it all would violate privacy laws. Providing open access to historical blocked or personal tweets would also erode trust.

Elon Musk has demonstrated an uncanny ability to succeed at previously unprofitable enterprises like electronic vehicles and space travel. Perhaps there is more to Elon’s Twitter plan than is apparent. Perhaps he is saying what he needs to say to ensure public support for his Twitter takeover.

While open sourcing Twitter’s algorithm appeals to the romantic notion that information is better free, increased transparency will not create a “maximally trusted and broadly inclusive” Twitter. Social networks like Twitter coalesce almost unbelievable amounts of data almost instantly into our hands. They have difficulty with contentious issues and, therefore, trust because they reflect existing contention back at us. It is easier to blame the mirror than ourselves.

The image is a cropped version of https://commons.wikimedia.org/wiki/File:Programming_code.jpg. Used under Creative Commons Attribution-Share Alike 4.0 International license.

Privacy

Privacy is one of those oft misused terms that people throw around, particularly related to discussions of the SOPA or PROTECT-IP acts in the US. This is an ethical question and, unfortunately, arguments on either side usually degenerate into straw man arguments about “big brother” versus “pirates, criminals or terrorists”.

Taking a step back, privacy can mean one of three things in the context of information technology. First, privacy can mean anonymity, where people can contribute to discussions or other activities without having those comments attributable to them directly. Outside large organizations, this is usually accomplished by adopting a new identity such as a forum user or an online game character. Inside organizations, anonymity is rare. Authority and accountability require real names to be used and identity can be centrally managed even if single sign on or federated authentication are still rare.

Many arguments over anonymity descend into questions of what information can individually identify people, called Personally Identifiable Information (PII)? For example, is the IP address you use to access the Internet PII? If you are the only person accessing the Internet from that IP address and you use it for a long time, it may be. However, if you are behind a NAT, firewall or similar measure this may not be the case.

The problem is these discussions often consider potential PII in isolation. For example, my company regularly performs employee surveys. As the only Australian employee in my business group, if I select my country or office, I immediately lose anonymity. Few will argue that someone’s country is PII but it is more complicated in this case. Add that to easy inference and access to analytics and the situation becomes even more complicated.

Second, privacy can mean confidentiality, where people want to restrict access to information. This is usually enforced by access control (e.g. file permissions) or encryption (controlling who has access through protecting and distributing keys). A common example is a person’s medical records being available to medical professionals treating them but not to others. These records may be available to a wider audience of medical professionals for research or statistics as long as PII is removed.

However, confidentiality alone is not sufficient for privacy. Continuing the medical example, just because you want your doctor to see your medical records does not mean you want him or her to send them to a local newspaper to print potentially embarrassing stories about you. The doctor is permitted to use the records for treating you only. In other words, privacy can mean restricting information use, usually defined via laws or consent from the subject (the person the information describes or identifies).

Privacy in all three forms is clearly important for software dealing with external customers, particularly in areas with heavy legislation such as the medical or financial industries. Information on these could fill novels, is usually jurisdiction specific and is better covered elsewhere.

Some would argue enterprise software targeted at employees is less concerned with privacy. Most organizations’ policies state that employees using organization provided computers or systems submit to scanning for malware, logging of actions, indexing and retention for later retrieval and so on and complying to these policies is usually a condition of employment. Some countries’ governments also require access to otherwise confidential information, such as the recent issues with Blackberry devices being too secure.

However, this is not the case everywhere. Many countries, Europe in particular, have strong privacy laws. These benefit the subject by restricting the collected information’s use to that consented at the time of collection. However, well-meaning privacy legislation can impact IT in unintended ways. For example, if an application logs the path to a file “c:\users\joe\my documents\doc.txt” (Windows) or “/home/joe/Documents/doc.txt” (Mac, *nix) for usage statistics or supportability, has it inadvertently captured the user’s name (clearly PII) in the path and should the application remove or obfuscate that directory in the path?

Many countries also limit the movement of PII across country borders. Consent can permit it but the consent must be specific and prior. This creates challenges when aggregating data across and enterprise, systems with ad hoc reporting systems or those that share data. This is particularly challenging with cloud based systems where the location of data is unclear or data is replicated across multiple locations for redundancy.

The target market of software architect’s products influence or dictate its privacy needs.Indeed, as the individual responsible for non-functional requirements, software architects should understand which form(s) of privacy apply and for whom. They need not be experts – legal departments are for that – but knowing what to work around and work with are important, particularly for bigger sales. Indeed, if SOPA/PROTECT-IP is passed, software architects may have even more to learn and apply.


%d bloggers like this: