Latent Auto-recursive Composition Engine

A Generative System For Creative Expression in Human-AI Collaboration

This thesis investigates the shifting boundaries of art in the era of Generative AI, critically examining the essence of art and the legitimacy of AI-generated works. Despite significant advancements in the quality and accessibility of art through generative AI, such creations frequently encounter skepticism regarding their status as authentic art. To address this skepticism, the study explores the role of creative agency in various generative AI workflows and introduces a "human-in-the-loop" system tailored for image generation models like Stable Diffusion.

The Latent Auto-recursive Composition Engine (LACE) aims to deepen the artist's engagement and understanding of the creative process. LACE integrates Photoshop and ControlNet with Stable Diffusion to improve transparency and control. This approach not only broadens the scope of computational creativity but also enhances artists' ownership of AI-generated art, bridging the divide between AI-driven and traditional human artistry in the digital landscape.

 

Background

The Essence of Art

The ability of generative AI to create art has sparked scholarly debate for decades. While visually impressive, AI-generated art often lacks contextual meaning and innovative structure. The art community frequently views these creations as mere imitations of human effort, learned statistically from existing works. Assessing AI-generated art is challenging due to the subjective nature of art evaluation and the absence of universal standards for artistic quality.

To advance this discussion, we must clearly define what constitutes art. This definition will help us develop a framework to assess the artistic merit of computer-generated works and explore the complexities of computational creativity.


What Makes an Art?

Discussions surrounding the definition of art have produced many theoretical frameworks, yet no theory has successfully encompassed all aspects of art. Philosophers such as Morris Weitz have even challenged the pursuit of defining art's essence, arguing that art is an inherently "open concept."

However, adopting a broader definition may be beneficial. George Dickie's perspective in "Defining Art" offers a foundational approach to identifying what qualifies as a work of art. Dickie differentiates between the generic concept of "art" and specific sub-concepts like novels, tragedies, or paintings. He suggests that while these sub-concepts may not have the necessary and sufficient conditions for definition, the overarching category of "art" can be defined.

According to Dickie, two key elements are essential: a) artifactuality—being an artifact created by humans; and b) the conferring of status—where a society or subgroup thereof has recognized the item as a candidate for appreciation.

This framework seeks to avoid the pitfalls of traditional art definitions, which often implicitly include notions of "good art," which are overly restrictive or depend on metaphysical assumptions. Instead, his definition aims to reflect the actual social practices within the art world.

Building on this legacy, this thesis will explore artifactuality through computational methods in the context of generative AI, focusing on one of the foundational elements that might define what can be considered art in the digital age.


Can Humans Distinguish Between Human and Machine-Made Art?

Addressing George Dickie's concept of the "conferring of status," a pivotal inquiry arises: can people discern between artworks created by humans and those generated by machines, especially when the quality of machine-made art rivals that of human creation?

A 2024 study by Kazimierz Rajnerowicz reveals a growing difficulty in distinguishing between AI-generated and human-created images, with up to 87% of participants unable to make accurate identifications. This difficulty persists even among those with AI knowledge. Rajnerowicz's article examines how individuals judge the authenticity of images, the potential risks of failing to recognize AI-generated content, and underscores the necessity of understanding AI advancements to prevent deception by deepfakes and other sophisticated AI techniques.

Lucas Bellaiche et al. delves into the perception and contextual meaning between humans and AI-generated art. Their study indicates that people tend to perceive art as reflecting a human-specific experience, though creator labels seem to mediate the ability to derive deeper evaluations from art. Thus, creative products like art may be achieved—according to human raters—by non-human AI models, but only to a limited extent that still protects a valued anthropocentrism.

However, a study by Demmer and colleagues titled "Does an emotional connection to art really require a human artist?" uncovered compelling evidence indicating that participants experienced emotions and attributed intentions to artworks, independent of whether they believed the pieces were created by humans or computers. This finding challenges the assumption that AI-generated art is incapable of evoking emotional and intentional human elements, as participants consistently reported emotional responses even towards computer-generated images.

Nonetheless, the origin of the artwork did have an impact, with creations by human artists eliciting stronger reactions and viewers often recognizing the intended emotions by the human artists, suggesting a nuanced perception influenced by the actual provenance of the art.


Why do people think AI generative artwork is “artificial”?

Generative AI empowers artists to manipulate the latent space with ease, creating new artworks through simple prompt modifications. Advanced techniques and tools such as LoRa, ControlNet, and image inpainting provide even greater control over the generative output. Despite these capabilities, there is a prevalent bias among observers who view computer-generated art as "artificial." This perception stems from several factors:

  • Lack of Human Touch: AI-generated art lacks a direct human creative process, leading to views of it being less genuine or lacking soul.

  • Reproducibility: The ability of AI to rapidly produce multiple, similar outputs may reduce the perceived value and uniqueness of each artwork.

  • Transparency and Understanding: The opaque decision-making process of AI systems often results in doubts about the creativity involved.

  • Missing Context: AI does not fully understand or express the social, cultural, or political nuances that deepen traditional art, often making its products seem technically proficient yet shallow.


Creative Process and Iterative Intent

In his 1964 work, "The Artworld," Arthur Danto emphasizes that art relies on an "artworld" consisting of theories, history, and conventions that recognize it as art rather than mere objects. This notion highlights a pivotal idea: in modern art, the creative process might be more important than the actual artwork. Without the artworld's narratives and contexts, the audience may struggle to grasp the artwork, as it is not guaranteed to deliver the same experience, potentially widening the gap between art perception and concept. This gap further expands in AI-generated art due to the opaque nature of AI models, which obscure the creative process and question the legitimacy of the artwork.

Artists often find themselves unable to articulate the relationship between their input (text prompts) and the machine-generated output, reducing their sense of ownership over the work. Despite potentially high-quality results, artists might not view these outputs as their own creations, leading to a perception of AI models as the true authors.

Moreover, the iterative nature of creative intent presents further challenges. Artists typically do not start with a clear vision; instead, they develop and refine their goals through the creative process, an approach fundamental in fields such as design, architecture, or illustration, where concepts often evolve through iterative experimentation. For instance, in architectural design, a technique known as "Generative drawing" plays a critical role. Described in "Generative Processes: Thick Drawing" by Karl Wallick, this method involves using drawings not just as tools for documentation but as active participants in the design process. These drawings help conceptualize ideas while integrating both abstract thought and practical execution into a single visual narrative, maintaining visibility of the design process to enhance creative exploration.

This contrasts sharply with the requirements of most generative models, which necessitate a well-defined intent from users, usually articulated through precise text prompts. This rigid structure creates a significant disconnect: if an artist’s intent shifts during the creative process—a common occurrence—the output from the generative model may no longer align with their evolving vision, rendering the quality of the result irrelevant.


Charting artistic authenticity: A new metric for assessing the authenticity of artwork by mapping agency on an additional axis. Here, the spectrum of agency spans from fully autonomous AI-generated art to human-driven creation, offering a nuanced view of artistic origination.

Sense of Agency

Agency in generative artwork refers to the capacity of the creator to make independent decisions that significantly affect the outcome of the art. In traditional art, agency is clear-cut; artists consciously choose every detail of their work, from the medium to the message. However, in AI-generated art, the concept of agency is more nuanced. The human operator provides text prompts or images, and the AI then processes these inputs based on its training data. Many artists contend that presenting the raw output of AI as one's own work amounts to theft and a lack of originality since these outputs are built on the contributions of other artists' styles, often utilized without consent concerning copyright and creative thought. Therefore, understanding the agency's complexities is crucial in human-AI collaborations, affecting the quality, ownership, and impact on computational creativity.

The Importance of Agency

In HCI, studies on the Sense of Agency (SoA), like Wegner et al.'s "Vicarious agency: experiencing control over the movements of others," are prevalent in measuring one's perception of the cause-and-effect on an event or object, and thereby determine whether the person has ownership over it. However, these studies primarily focus on the augmentation of body, action, or outcome rather than internal states like creativity aided by machines. Yet in art, particularly AI-enhanced creation, agency's significance extends to a proposed axis that gauges the artwork's authenticity.

AI art creation traditionally orbits two axes: the architecture grounding the model in training data, and the latent navigator that seeks the desired image from input prompts. This process epitomizes the creative journey in AI art, where the user steers the pre-trained model with tools to match their artistic vision.

Introducing the agency axis reframes the artistic origin narrative. At one extreme, complete human intervention equates to a work that is entirely the artist's intellectual property, where every creative facet is handpicked. On the other hand, excessive reliance on AI for composition and style risks depersonalizing the art, prompting the art community's devaluation of such works.

An optimal creative balance is struck when both human and AI inputs interact, fostering a space where creative spontaneity meets deliberate artistry, and ownership over the end product is clear. This intersection becomes a breeding ground for creative discovery, marrying human intention with AI's potential.

Intentional Binding and Human-AI Interaction

A fundamental aspect of human cognition that illuminates our interaction with AI in creative processes is intentional binding. This phenomenon, where individuals perceive a shorter time interval between a voluntary action and its sensory consequence, highlights how the perception of agency influences our engagement with the world. In the context of HCI, especially in AI-enhanced art, understanding intentional binding provides valuable insights into how artists perceive and integrate AI responses into their creative expression.

When artists interact with AI, the immediacy and relevance of the AI's output to the artist's input can affect their sense of control and creative ownership. If the output closely and quickly matches the artist's intention, similar to the effects observed in intentional binding, the artist may experience a greater sense of agency. This heightened perception of control can make AI tools feel more like an extension of the artist’s own creative mind rather than an external agent imposing its own logic.

Therefore, in discussions about agency in AI-generated art, it is crucial to consider how the principles of intentional binding might play a role in shaping the artist's experience of the creative process. This understanding can guide the development of more intuitive AI systems that enhance the artist's agency, promoting a more seamless and satisfying creative partnership.

The Technology of Image Synthesis

The art community has experienced profound transformations over the years, significantly influenced by advancements in artificial intelligence and machine learning. The evolution of generative models has been pivotal in shaping both the capabilities and applications of generative art. In this section, we will explore the history of technology in art.

The Evolution in Generative Models

The development of generative models in image synthesis and artistic creation has seen remarkable transformations since the mid-20th century. Initially, in the 1950s and 1960s, pioneering artists utilized oscilloscopes and analog machines to produce visual art directly derived from mathematical formulas. With the advent of more widely accessible computing technologies in the 1970s and 1980s, artists such as Harold Cohen began to employ these tools to create algorithmic art. Cohen's work with AARON, for instance, involved using programmed instructions to dictate the form and structure of artistic outputs. During the 1980s, the emergence of fractal art marked a significant advancement, employing mathematical visualizations to craft complex and detailed patterns. The 1990s introduced evolutionary art and interactive genetic algorithms, which further democratized the creative process by allowing both artists and viewers to participate in the evolution of artworks.

The resurgence of neural networks in the 2000s, fueled by advancements in GPU computing power and the availability of large datasets, significantly propelled technological innovation. Prominent developments such as Convolutional Neural Networks (CNNs), Variational Autoencoders (VAEs), and Generative Adversarial Networks (GANs) transformed the landscape of generative art. These models revolutionized the field by generating highly realistic images that closely mimic the characteristics of their training data.

Recent advancements, such as Vision Transformers (ViTs), Latent Diffusion Models (LDMs), and Mixture of Experts (MoE) models like RAPHAEL, have pushed the boundaries by enabling the synthesis of complex images from detailed text prompts, facilitated by Contrastive Language-Image Pretraining (CLIP). These innovations merge artistic expression with cutting-edge technology and challenge traditional notions of creativity and the artist's role.

These developments highlight a progression from relatively simple predictive models to sophisticated systems capable of understanding and generating complex visual content, demonstrating that models can encode art concepts into embeddings and later reconstruct or synthesize them for new artistic purposes.

The limitation of Text-to-Image Models

Text prompts serve as a versatile and universally accessible method to guide the generation of images. As large language models evolve, text-to-image models are increasingly capable of interpreting both literal and semantic meanings of text prompts. Nevertheless, these models often face challenges in accurately rendering complex or abstract concepts based solely on textual descriptions. This misalignment between the generated content and its intended semantics presents several issues that need addressing.

Misalignment of Text Encoding

Wu et al. (2023) explore the disentanglement capabilities of stable diffusion models, demonstrating that these models can effectively differentiate between various image attributes. This disentanglement is facilitated by adjusting input text embeddings from neutral to style-specific descriptions during the later stages of the denoising process.

For instance, as illustrated in Figure 1 (Wu et al. 2022), the prompt "A photo of a woman" might yield significantly different results from "A photo of a woman with a smile." Although modifying text embeddings can help segregate different attributes, this approach struggles with fine, localized edits and may be overwhelmed by overly detailed neutral descriptions.

Moreover, dependency on prompt engineering often encounters inherent limitations as the model's semantic interpretation can significantly diverge from human understanding. Such dependence is typically restricted by the labels in the training dataset. For instance, using specialized terminology like "monogram" to describe a straightforward "black and white" graphic may lead to unforeseen results, largely due to the annotators' limited domain-specific knowledge.

Despite potential improvements in semantic understanding and better alignment of text embeddings with image and art concepts as large image generation models become more complex and larger, users may still face difficulties in articulating abstract concepts through text. The main challenge arises from the discrepancy between how datasets are annotated and how users describe their desired outcomes using the same vocabulary.

 
 

Download the full research thesis in PDF

Github RP: https://github.com/iamkaikai/LACE

kai huangHAI
手指識字

source of the research paper :

https://sclee.website/wp-content/uploads/2020/04/finger-reading-physical2.pdf

kai huang
The Next Big Thing

I visited the Apple store today to experience the Apple Vision Pro demo, which could very well be the next iPhone moment.

Here are my thoughts:

In summary, Apple Vision Pro represents a significant leap in XR technology, combining high-fidelity visuals with an innovative interaction system. It's not just a new device but a whole new platform that could redefine how we interact with digital content, showcasing Apple's ability to innovate and lead in new technology domains.

The Pro's Micro-OLED display is absolutely top-notch, combined with a 90hz refresh rate, the realism is almost brain-deceiving. Watching videos and 3D movies feels genuinely different from previous devices, suggesting Apple believes only ultra-realistic visuals can make XR applications viable. Probably only Apple would spend so much to convince customers to pay for this.

The Passthrough experience powered by the R1 chip significantly outperforms competitors, and while there's still some noise in the visuals, it's leagues ahead of the Quest3. It's noticeable but good enough to overlook. No doubt, hardware performance will continue to improve, which is Apple's moat, developing a dedicated chip is truly remarkable.

Given the visuals can compare with reality, the focus naturally shifts to photo and video. Spatial Video truly immerses you into wonder, just short of tactile and olfactory experiences. It's hard to explain without firsthand experience, like explaining the internet to someone 100 years ago. One demo was watching a soccer game as if you were sitting above the goal, watching the goalkeeper defend. The narrator mentioned some things money can't buy (but buying the headset can). Thinking about it, if attending the World Cup front row costs a fortune, then $3500 for multiple games in the best seat seems reasonable. This has huge potential in entertainment and film, possibly changing how we sell tickets to sports events. This might also explain why Apple insists on having its streaming platform, Apple TV+.

Apple introduced a whole new interaction system for Vision Pro, arguably the most systematic and suitable for extended use to date. Simply put, Apple has separated the actions of scrolling and clicking into eye tracking and hand gestures. Since the interaction space is 3D, traditional 2D screen interactions don't apply. To overcome the freedom on the Z-axis, Apple ingeniously replaced the mouse with eye focus, meaning you don't have to keep lifting your hands (though you can if you want). However, since the eyes rapidly jitter when focused on something, hand gestures are used to confirm actions and control inputs like clicking, sliding, and pinching, with more gestures likely to be developed. [1]

Based on this, users can envision resting their hands on a desk, lap, or chair to operate the system. For precise input, keyboards, mice, drawing pads, and customized controls represent future possibilities. This is a significant shift from the "raising hands" mode since it's impractical to do so for 8 hours. Users can operate a 3D system without lifting their hands, freeing them for other tasks like typing on a computer, eating, cooking, or exercising. You could even operate another computer system simultaneously without being tied to the Vision Pro.

From a development perspective, Apple has focused on developing the underlying system and establishing fundamental interaction principles, using systematic standards to build a new XR ecosystem similar to the App Store ecosystem. A good example is the learning curve for elderly individuals who have never played VR to learn how to control a controller, which can be quite steep. However, introducing them to gesture interactions based on MacOS and iOS is relatively easy for them to grasp. This is why all demos showcased basic functionalities, gesture operations, web browsing, videos, and apps, rather than flashy 3D games. Currently, the Oculus ecosystem operates in a disjointed manner, with Meta, Unity, and content creators each doing their own thing, leading to incompatible SDKs and poorly executed documentation.

In terms of weight, it's slightly heavier than Quest3, perhaps because I added lenses. But honestly, Quest3 isn't much lighter. Also, Vision Pro gets warm after prolonged use, making the face cushion feel hot, but these are solvable issues. For example, you could lie down to watch movies or work, or change the headband or pad.

Price-wise, many initially thought $3500 was too steep, and rumors suggested manufacturing yields were low, so stock might be limited. But at the store, scalpers were buying in bulk, and staff mentioned a wide age range of buyers, from 60 to teenagers. Several employees even purchased it themselves, sharing they were initially skeptical but were ultimately impressed. When asked about the comparison of one Vision Pro to seven Quest3s, they said seven Toyotas can't compare to one Aston Martin. Apple's brand premium is indeed high.

In summary, most criticisms of Vision Pro revolve around its price. Conversely, even if Meta spent the same amount on hardware, integrating software to match Apple's experience might not be possible. Apple has transferred its experience in phones, computers, systems, apps, headphones, and AI to Vision Pro. I anticipate a surge in XR designer and developer jobs, marking a new era of opportunities.

The question arises: Why, after a decade of Oculus and significant investment by META, did they not opt for high-end pricing and significant breakthroughs in interaction solutions, instead seemingly focusing on hardware upgrades? Apple, on the other hand, has forged its own path. I believe that even after Facebook rebranded to META, the product vision remained unclear, lacking effective integration of software and hardware. In contrast, Apple's vision has shifted the discussion from a VR Metaverse to a reality-integrated XR system. The question now is, what does the future computer system look like in a 3D space? How to build this platform and enable developers to enrich this ecosystem?


今天去 Apple store 體驗了 Apple Vision Pro 的 demo, 可以算是下一個 iPhone moment 了。

記錄下自己的觀點:

  1. 總而言之,Apple 在基於超高保真的視覺效果下,把MacOS的系統挪到新的XR平台上。為了達到這個願景,蘋果把所有黑科技堆到頭盔上。為了對應新平台的特性,使用場景,蘋果重做了一整套新的交互方案。或許這就是蘋果破解手機創新困境的解法,一個全新的XR平台,Spatial computing。

  2. Pro 的 Micro-OLED display 絕對是最頂級的了,配上 90hz 的刷新,畫面率逼真度幾乎可以欺騙大腦了。看視屏和 3D 電影在感受上確實和以往其他的裝置不用,Apple 似乎在表達只有夠逼真的畫面才能讓 XR 應用合理。估計也只有蘋果會這樣做砸錢讓客戶買單。

  3. 搭配 R1 chip 的 Passthrough 體驗確實吊打市面上其他家的性能,雖然畫面還是有點雜訊感,但遠勝 Quest3 好幾倍。雖然還是肉眼可感受,但足夠好到可以忽略。毫無疑問未來硬體性能肯定會持續提升,這也是蘋果的護城河,要自己幹出一個專用的晶片還是 Apple 牛逼。

  4. 既然視覺上和現實能比擬,展示重點自然就是和照片和視屏。Spitial Video 確實讓你深入奇境,只差沒有觸覺和嗅覺的體驗。真的能做到大腦被欺騙的感覺,這點在沒有親生體驗很難用言語解釋。如同你回到 100 年前和古人解釋 Internet 怎麼運作一樣。其中一個DEMO是觀看足球比賽,你坐在在球門上方臨場觀看守門員如何防守地方進攻。當球破門而入的時候,旁白表示有些東西錢都買不到(但買頭盔可以)。事後想想,如果 World Cup 坐第一排得花多少錢買門票,如果$3500讓你連看多場球賽並坐在最好的位置,似乎也挺划算。這塊在娛樂、影視行業肯定潛力巨大,說不定以後球賽改賣 Vision Pro 的 ticket。這可能也解釋了為何 Apple 當成硬要搞自己的串流平台 Apple TV+。

  5. Apple 替 Vision Pro 搞了一套全新的交互,也算是目前最有系統、適合長時間使用的方案。簡單說 Apple 把鼠標的 scroll 和左右鍵的動作拆分在眼部定位和手勢操作上。因為操作空間是3D 的,所以傳統設計給 2D 屏幕的交互不適用在此。為了克服在 Z 軸的自由度,Apple 巧妙的用眼睛的焦點來替代鼠標,也就是說用眼睛來投射空間深度,你就不用一直舉手控制遠近(你想要可以,畢竟手有極限)。但因為人眼在盯著看東西的時候會快速來回氈抖,所以當眼睛鎖定目標後,用的是手來確認動作和控制 input 的數值,如點擊、滑動、pinch,未來肯定還會有更多手勢交互會被開發出來。[1]

  6. 基於上一點,你可以預想使用者把雙手坐在桌上、腿上、椅子上,操作系統。如果需要精確的輸入,鍵盤、鼠標、手繪板、定制化 control 都是未來的想象空間。這和現有的 “伸手” 模式有很大的不同,因為你不可能舉手 8 小時。使用者可以做到不用抬手的情況下操作 3D 系統,甚至解放雙手做其他任務,如在電腦上打字、吃東西、煮飯、健身。甚至你可以同時操作另一套電腦系統, 不需要一定得和 Vision Pro關聯。

  7. 從開發的角度,Apple 算是把重心放在開發底層系統,建立底層交互原則,用系統性的規範來建立新的 XR 生態,提供一個類似 App store 的生態。一個很好的例子是,讓一個沒玩過 VR 的老年人學習如何控制 controller,Learning curve 還是挺陡峭的。然而讓一個老年人控制基於 MacOS 和 iOS 的手勢交互,其實對他們來說就挺容易上手的。這也是為何所有 demo 展示的都是基礎功能展示、手勢操作、瀏覽網頁、影片、App,而不是酷炫的 3D 遊戲。目前 Oculus 生態都還是各做各的形式,Meta、Unity、Content creator 各搞各的,SDK 之前互相不兼容,Documentation 也做的跟坨屎一樣。

  8. 重量上確實比 Quest3 重一點,也可能是我加了鏡片。但老實說 Quest3 也沒輕到哪裡去。另外 Vision Pro 待久了機器發熱,靠著臉的 cushion 確實還挺悶熱的,但這些都是好解決的問題。比如,你可以躺著看電影、工作,或者換個頭戴、墊片。

  9. 價格方面來說,很多人一開始覺得 $3500 肯定沒人買,另外傳聞工廠製造良率不足,所以庫存應該不夠,必須得等一段時間。但人到現場發現,一堆代購人手 10 個,店員表示購買人群從 60 到十幾歲都有。甚至好幾個店員自己都掏錢買了,他們表示一開始也都是半信半疑,但體驗完都表示非常震撼。我問店員有人覺得一台 Vision Pro 可以買 7 台 Quest3,但他說七台 Toyota 不能和一台 Aston Martin比。看來蘋果的品牌溢價確實高。

  10. 基於以上,目前大部分人對 Vision Pro 挑的毛病都是價格。反過來說,如果 Meta 用同樣的錢,砸出一樣的硬件,但軟體整合或許也未必能做到和蘋果一樣的體驗。 蘋果算是把之前做手機、電腦、系統、App、耳機、AI 的經驗都移植到了 Vision Pro 上了。我預計未來會有大量的崗位招聘 XR designer、XR developer,確實是個新時代的機遇。

那麼問題來了,為何做了十年的 Oculus 和投入大量資源的 META 沒有選擇走高端定價的產品,以及交互方案上沒有重大突破,甚至變成了只是在追器硬件的升級。然而蘋果卻走出了自己的道路?我認為即使 Facebook 把公司名稱換成 META 之後,產品的路徑依然沒有清晰的遠景,並且在軟硬及硬體上缺乏有效的整合。而在 Apple 的遠景中,現在的討論已經從 VR 的Metaverse 變成和現實結合的 XR 系統。即,在一個 3D 空間中,未來的電腦系統是什麼樣子?如何打造這個平台,並讓開發者豐富這個生態。

kai huang
Ink Drop

This is a collaboration with META’s internal team and Dartmouth ILIXR to develop an AI-powered drawing tool leveraging Stable Diffsino technology. The focus is on the human-in-the-loop design experience and AI integration.

In our study, we concentrate on methods that manipulate the denoising diffusion process, aiming to expand the scope of control over the generated outcomes and image editing [1]. Crucially, our objective is to involve artists directly in the process, thereby transforming it into a more transparent and interactive 'white box' system.

A screenshot of the system connected to Photoshop

 

Guided Image Synthesis
Using Diffusion-Based Editing Techniques

The fundamental idea is to modify the diffusion process in Stable Diffusion to make it transparent and controllable. This is done by visually guiding the process and making it modular. By controlling the scheduler and specific points in the diffusion timeline, artists can achieve more precise results. Furthermore, incorporating drawing software like Photoshop for editing input treats it as a memory unit, preserving the latent states in Photoshop's editing history. This method mimics a Recurrent Neural Network (RNN) by storing the latent states and using the output from one process as the input for the next, leading to highly consistent outcomes.


Visualization of the Denoising Scheduler

In the project, we use ComfyUI for prototyping. The Advanced KSampler always adds noise to the latent followed by completely denoising the noised-up latent, this process is instead controlled by the start_at_step and end_at_step settings. This makes it possible to e.g. hand over a partially denoised latent to a separate KSampler Advanced node to finish the process.


denoise = (steps - start_at_step) / steps

source: https://stable-diffusion-art.com/how-stable-diffusion-work/

The amount of noise reduced at different start_at_step

Extensive research, including the findings mentioned in [7][10], has analyzed the influence of noise schedules on model performance. Such factors may similarly alter the backward diffusion process, an area we will examine thoroughly in subsequent sections of this document. The subsequent diagram offers a straightforward depiction of how noise schedules affect outcomes. Interestingly, the team at ByteDance uncovered a bug in the commonly used sampling method from the original stable diffusion team, which produces incorrect samples in response to explicit and straightforward prompts [9].

 

Ziyi Chang, George Koulieris, Hubert P. H. Shum, Senior Member, IEEE, On the Design Fundamentals of Diffusion Models: A Survey [7]

By employing different denoising schedulers,
we notice varied outcomes using an identical sampling method and seed.

Comparision of “start_at_step” parameters, varying from 14 to 20, arranged left to right, using SD 1.5 with LoRA.


Visualization of the Denoising Process

The schedule's step count is crucial. Allowing the sampler more steps enhances accuracy. In this context, we construct a custom node to terminate the denoising diffusion process at a predetermined step.



Modulizing the diffusion process

By controlling the scheduler and the diffusion process, we observe the relationship between creative spot and input consistency in the two plots below. [3]

 
 

Having the control of the scheduler and denoising steps, we can use the latent result from the first diffusion process as the input for the second diffusion process.

Redirecting the generation process by given different negative prompts

Merging Concepts from Different LoRAs

Connecting the pipeline in Photoshop

We can utilize Photoshop to modularize the backward diffusion process as a memory block to track positions in latent space. Additionally, leveraging the layer mechanism can yield more consistent results from the model across various inputs.

Synthesizing images from strokes with SDEdit. The blue dots illustrate the editing process of our method. The green and blue contour plots represent the distributions of images and stroke paintings, respectively. Given a stroke painting, we first perturb it with Gaussian noise and progressively remove the noise by simulating the reverse SDE. This process gradually projects an unrealistic stroke painting to the manifold of natural images.

Essentially, employing Photoshop as a memory block akin to an RNN facilitates the exploration of creative possibilities within the image space.

Reference

[1] Zhang, L., Rao, A., & Agrawala, M. (2023). Adding Conditional Control to Text-to-Image Diffusion Models. arXiv preprint arXiv:2302.05543v3. https://doi.org/10.48550/arXiv.2302.05543

[2] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021

[3] Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2021.

[4] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.

[5] Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18208–18218, 2022.

[6] Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-a-scene: Scenebased text-to-image generation with human priors. In European Conference on Computer Vision (ECCV), pages 89–106. Springer, 2022. 2

[7] Ziyi Chang, George Koulieris, Hubert P. H. Shum, Senior Member, IEEE, On the Design Fundamentals of Diffusion Models: A Survey

[8] Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J.-Y., & Ermon, S. (Year). SDEdit: Guided image synthesis and editing with stochastic differential equations. Stanford University; Carnegie Mellon University. https://doi.org/10.48550/arXiv.2108.01073

[9] Lin, S., Liu, B., Li, J., & Yang, X. (Year). Common diffusion noise schedules and sample steps are flawed. ByteDance Inc. https://arxiv.org/abs/2305.08891

[10] Deja, Kamil et al. “On Analyzing Generative and Denoising Capabilities of Diffusion-based Deep Generative Models.” ArXiv abs/2206.00070 (2022): n. pag.

kai huangHAI
MonsterGAN

Abstract

With the ever-powerful deep learning algorithm, computer graphics have been pushed to a new level. The generative adversarial network (GAN) can now generate almost any type of photo-realistic images with the proper size of datasets. However, most of the GAN use cases have been limited to the pursue of lifelike graphics. In this article, I prose a new framework “MonsterGAN,” combining machine learning, design, and psychology. MonsterGAN is a prototype of a generative design system (DRCI), which reduces the cognitive burden of creation and makes creativity scalable, for concept artists.

What happens if computer vision passes the Turing test? Where and how can we use it?
As a designer, I’m fascinated by these questions because we designers are the graphic wizards who deal with creativity and graphics daily. One bold idea comes to my mind: can we have machine conceptualized creativity?

In 1951, Fitts invented the famous concept of function allocation and the MABA-MABA list. (Men-Are-Better-At/Machines-Are-Better-At lists) It explored the possibilities of how man and machine can work together as a team. In terms of the SRK taxonomy (skills-, rules-, and knowledge-based tasks),

computers’ capability is limited to skills-based tasks and rules-based knowledge.With deep learning, I believe that the level of automation has changed, and machines can do specific knowledge-based jobs, which makes it necessary to rethink the notion of function allocation.

On the other hand, humans are good at jobs that are not well-defined with their strengths: design, empathy, and generalization. With teamwork, we can change the order of process, such as man-machine-man, machine-man-machine, and machine-machine-man.I propose a new MABA-MABA list for modern challenges

that are mostly knowledge-based. We can have machines do the first 70% of a job and have humans pick up the last 30%. Machines are good at solving well-defined problems with their strengths: speed, precision, variation, scaling, and sense;


Methodology: New Creative-thinking workflow

In the early stage of creative thinking, the target is usually uncertain. This makes it impossible for machines to implement creative thinking. But what if we can dismantle each step of creative thinking and allocate the tasks according to the new MABA-MABA list?

As we know, there are four stages in the thinking process:


Preparation

The preparation step consists of observing, listening, asking, reading, collecting, comparing, contrasting, analyzing, and relating all kinds of objects and information.

Incubation

The incubation process is both conscious and unconscious. This step involves thinking about parts and relationships, reasoning, and often a fallow period.

Illumination

Inspiration very often appears during this fallow period [of incubation]. This probably accounts for the emphasis on releasing tension in order to be creative.

Verification

The step labeled verification is a period of hard work. This is the process of converting an idea into an object or into an articulated form.

I think that there is an opportunity to have machines that assist humans in the first two steps of the creative thinking process: preparation and incubation. As we know, uncertainty in a creative project usually stops people from delivering results on time because there are too many possibilities, and people tend to change their minds at the last second. What if we can build a generative design system for problem-solving processes of abduction and induction? This can help us decrease the time we spend on preparation and incubation; therefore, it “accelerates” the “Aha” moment.


Machine Divergence and human Convergence

In a traditional design thinking process, people repeat the process of divergence and convergence until they come up with a solution. It is how people narrow down the direction and iterate the practice. However, the problem with repetitive creative labor is that humans burnout, limiting the possibility of scaling up creativity. With the new MABA-MABA list,

we can have machines diverge, and humans converge. If we can somehow encode ideas into numbers of vectors (this is exactly what deep learning is good at), it is reasonable to have machines diverge because computers can operate vectors easily; and this can also decrease the cognitive workload, helping humans work faster.


The importance of ambiguity

We know that Generative Adversarial Networks are really difficult to train. Both the quality and quantity of the data need to be high.
However, in reality, creative graphics data is usually not sufficient. This creates a dilemma because the output of the GAN flattens out at an unacceptable level of quality. Fortunately, we can bypass the problem by asking a GAN model to generate ambiguous images. Therefore, we do not expect a model that generates “photo-realistic” results. Instead, a “good enough” model will be sufficient for humans to pick up.

Why do we want ambiguous results? It turns out that ambiguity plays a vital role in creative thinking. (Tamara Carleton, William Cockayne, and Larry Leifer, 20, An Exploratory Study about the Role of Ambiguity During Complex Problem Solving) This resolves the problem of low-quality results with limited datasets because we need abstract images to get inspired. Also, symbolically, this idea matches the coarse-to-fine process in computer vision.

StyleGAN 2: generate ambiguous sketches

I decided to train a concept art model that requires heavy creativity with StyleGAN2. As a result, I came up with the idea of asking a model to generate abstract graphics. By providing low-level sketches to concept artists, they worked with these images as a foundation, and saved a substantial amount of time. I believe that these abstract graphics can, in some way, generate an emergence phenomenon for concept art. In this project, I used the implementation from this paper Training Generative Adversarial Networks with Limited Data. The method used in the paper:

The NVIDIA research team considered a pipeline of 18 transformations that were grouped into 6 categories: pixel blitting (x-flips, 90◦ rotations, integer translation), more general geometric transformations, color transforms, image-space filtering, additive noise, and cutout.

During training, each image shown to the discriminator used a pre-defined set of transformations in a fixed order. The strength of augmentations was controlled by the scalar p ∈ [0, 1], so that each transformation was applied with probability p or skipped with probability 1 − p. We always used the same value of p for all transformations. The randomization was done separately for each augmentation and for each image in a minibatch. The generator was guided to produce only clean images as long as p remains below the practical safety limit.


Experiment Results

Now, back to ambiguity, we use Fréchet Inception Distance (FID) to measure how well the GAN works (the lower the score the more realistic to your predictions). However, in the case of MonsterGAN, a low FID score doesn’t always mean “better” results. It turns out that, although the lower FID model did provide more texture “details” of the creatures, it actually loses the diversity of shapes and forms. (result is shown below)


Results of the different parameters of SyleGAN-ADA

P0.7 with a score of 29.65 in FID50K has more diversity in creature forms


MonserGAN: Designing

After we got a trained model, an artist can browse the forms library and choose the most suitable shape for their requirements. Instead of starting from scratch, which is most of the time-consuming part, artists can pick several images that they find interesting. Since the inputs of the GAN are noise vectors, this gives our infinite concepts.

Let’s say we consider this stony monster with big claws matching our direction. We can then use the input vector of this image as the center of the starting point (of latent space). By lowering down the truncation-psi and the sampling distance, we can achieve the effect of detail variation under a similar shape.

Let’s say we consider this stony monster with big claws matching our direction. We can then use the input vector of this image as the center of the starting point (of latent space). By lowering down the truncation-psi and the sampling distance, we can achieve the effect of detail variation under a similar shape.

MonsterGAN: Latent space exploration

It could also be the case that we want to merge multiple directions. And this is where latent space manipulation jumps in. Traditionally, if we want to change the feature of an image, tweak the Z latent space. In terms of combining different shapes, the below images are some results of Z latent space manipulation.

As can be seen, manipulating Z latent space is a rough approach to control features since mapping the Z to the features vector might entangle the distribution of features together. Even if we get an acceptable result, the art direction is just uncontrollable. Research related to this subject was discussed in the paper of Analyzing and Improving the Image Quality of StyleGAN. Thus, StyleGAN2 created 8 extra fully connected layers to encode W from Z.

In the implementation of style mixing, there are 18 style layers. We apply style layers to our target input with a range of style layers from 1 to 18 layers. (fully connect layers) I did some experiment of extracting the creature’s feature from the W latent space, and here’s what I found:

1. the reult of using only 1 layer is subtle
2. mixing 3–5 layers works the best
3. using all 18 layers may cause the result same as column style

You can find more details in this article

MonsterGAN: Human Refinement

After the artists are satisfied with the model’s result, concept artists can jump in and start working on the refinement. Now, here’s the beauty of ambiguity. Since every person perceives the same abstract sketch with different interpretations, it gives artists more flexibility to leverage their creativity. As can be seen, many “errors” were transformed into a new design.

In this project, I collaborated with concept artist Steve Wu, a senior concept designer specialized in creature design and has more than 6 years of experience in the film industry. The works shown in this article are credited to him. Check out Steve’s amazing work.

Steve Wu

“During my ten years of design career,
I realized that we had been constrained by the pattern when creating art. When I first heard about using AI as a tool to create art, I hesitated. However, after experiencing it, I realized that it expanded my imagination and saved me lots of time to start from scratch. Besides, some of the synthesized images were almost impossible to come up with the human mind. Overall, MonsterGAN, though not perfect, indeed augments human’s imagination!”

In this design, the abstract visual cues inspired Steve in different ways. First, we can see that the textures on the creature inspired Steve to create the teeth, hair on the head, wings, and extension of the abdomen. Furthermore, Steve decided to remove the block in the lower section of the legs.


This is a good example of how ambiguity inspires artists. The left side of the hog was originally a meaningless graphics generated by the model. Surprisingly, Steve managed to turn it into a snout of the hog. Also, the shape of the hog’s head was influenced by the texture of the original image, which became the highlight of this design.


Again, this was an interesting example of how an artist transformed the flaws of result into art. There were two fragments that were supposed to be removed. However, Steve changed it to two sparrows, standing on the creature’s horn.

Final Artworks


Related GAN-based tool for compositing: GauGAN

After finished the creature designs, the artists can start working on the background and merge the creature into the scene, providing a look and feel of the concept. Again, we can also develop models to assist humans in each step. (e.g. texture creation, lighting, color grading, scene creation) Since Nvidia has already built the GauGAN, a tool for generating scenery images, we will use it directly.


Evaluation

In a nutshell, I believe that the latent space of big data provides a higher dimension of creativity by creating a new medium for people to sculpture their imagination and experience. This is because we can use machine learning to extract information from the enormous datasets collected from mobile devices. In other words, Data Sculpturing translates our indescribable subjects or creativity to a latent vector and re-create the output to amplify the creators' creativity by vector arithmetic; The combination of machines diverging variations and humans converging solutions improves an artist's productivity and make creativity scalable.

Process of data Sculpuring


Bibliography

Carleton, Tamara & Cockayne, William & Leifer, Larry. (2008). An Exploratory Study about the Role of Ambiguity during Complex Problem Solving.. 8–13.

Sio, Ut Na & Ormerod, Thomas. (2009). Does Incubation Enhance Problem Solving? A Meta-Analytic Review. Psychological bulletin. 135. 94–120. 10.1037/a0014212.

Savic, Milos (2016) “Mathematical Problem-Solving via Wallas’ Four Stages of Creativity: Implications for the Undergraduate Classroom,” The Mathematics Enthusiast: Vol. 13: №3, Article 6.

How To Solve It, by George Polya, 2nd ed., Princeton University Press, 1957, ISBN 0–691–08097–6.

J. Rasmussen, “Skills, rules, and knowledge; signals, signs, and symbols, and other distinctions in human performance models,” in IEEE Transactions on Systems, Man, and Cybernetics, vol. SMC-13, no. 3, pp. 257–266, May-June 1983, doi: 10.1109/TSMC.1983.6313160.

Tero Karras, Samuli Laine, Timo Aila. (2019). A Style-Based Generator Architecture for Generative Adversarial Networks

Karras, Tero & Aittala, Miika & Hellsten, Janne & Laine, Samuli & Lehtinen, Jaakko & Aila, Timo. (2020). Training Generative Adversarial Networks with Limited Data.

Cummings, Mary. (2017). Informing Autonomous System Design Through the Lens of Skill-, Rule-, and Knowledge-Based Behaviors. Journal of Cognitive Engineering and Decision Making. 12. 155534341773646. 10.1177/1555343417736461.

R. Parasuraman, T. B. Sheridan and C. D. Wickens, “A model for types and levels of human interaction with automation,” in IEEE Transactions on Systems, Man, and Cybernetics — Part A: Systems and Humans, vol. 30, no. 3, pp. 286–297, May 2000, doi: 10.1109/3468.844354.

kai huangHAI
SkimGPT

https://github.com/iamkaikai/SkimGPT-client

https://github.com/iamkaikai/SkimGPT-api

kai huangHAI
Vision

https://www.behance.net/gallery/75550373/Datascapes

機會可遇不可求,有時候真遇到了,反而自己把機會給推掉了。路應該越走越廣,人別越活越狹隘。

kai huang
Malaysia 2024

May our journey together be as rich and colorful as the tapestry of our beloved homeland

kai huang
Aim!

晃動的槍口試圖對準前方,士兵嚥了口水,試圖用顫抖的手穩定槍枝。該死,這時候口渴,水已經在路上喝光了。

「鎖定前方目標,鎖定敵人!」後方指令響起。

狗娘養的,鬼影都沒看見,士兵回報:「已發現目標,已鎖定!」

「回報敵人座標!」側方指令響起,一個士官長貼在士兵旁邊,並對著他頰吐氣。

好像看見了43,又好像是41,不管了隨便挑一個吧。

「42!」士兵大喊,抖音伴隨著破音上揚。

「射擊!」子彈隨即從槍口竄出,滾燙的槍管把士兵整個身體彈開,子彈發瘋似的飛行。

「一定要中啊!」士兵心裡默念。

「複雜度回報!」後方傳來要求。

「時間:O(N),空間:O(1)」士兵再度回答,聲音在風中飄揚。

「脫靶!」後方表示。側方士官一腳踹向士兵,並吐了口水,表示不滿。隨著士官這側踢的力道,士兵人連同槍被一起被挑起,手中槍支差點從士兵手中飛起。士兵咽了口口水,重新抓緊了槍枝,並壓低了頭盔。

「再次射擊!」語音剛落,子彈的聲音伴隨著風聲一起怒吼。

寂靜的空中,一道閃電從士兵手中射出,雷鳴聲伴隨著鳥鳴,一片烏鴉從空中飛過。

「命中!」後方機器傳來消息,人工智能的聲音高效但不帶任何感情。

士兵心中暗喜,以為一切都結束了。

正當士兵準備起身時,後方機器發出扭曲的噪音。

「射擊校正,時間複雜度O(logN)」後方再度傳來指令。

狗日的!訓練的時候沒有這個操作啊。士兵腦中一片空白,試圖用最快的時間回憶所受過的所有演練,他嚥下了口水、並抓了抓帽子,想透過次方式拖延時間。

士兵再度調整槍枝,嘗試按照新的複雜度進行射擊,但定眼一看,鏡頭消失了。

士兵感到狐疑,頭頂一聲雷響轟然大作。士兵試圖請求長官協助,但能看到的只看的到長官的嘴型激烈運動,伴隨著噴飛的口水,並無任何聲音,仿佛被按下了靜音鍵。

隨著時間一分一秒地流逝,士兵搗鼓著武器,試圖調出機器的要求。

此時後方的雜音不見蹤影,側方的士官長不再歎息,鏡頭卻又出現了。士兵把槍枝放下,重新帶好頭盔,心中充滿了無盡的疑惑。他不明白。

就在此時,位於目標物的對岸傳來一聲巨響,畫面變暗。

一排字打在屏幕上,”No signal“ 。

玩家此時憤怒的從椅子上站起來,拍了拍顯示器,並再次帶上耳機,選擇菜單上的 ”Restart”。

kai huang

>+++++++++[<++++++++>-]<.>++++++[<+++++>-]<-.+++++++..+++.>>
+++++++[<++++++>-]<++.------------.<++++++++.--------.+++.------.--------.
>+.>++++++++++.

"Hello World!"


Brainf*ck has only eight instruction. The instructions are as follows: 

The variable "p" represents a pointer where p = 0 at the start of every program, and a[p] represents the byte at the pointer. Only ASCII characters represented by the byte the pointer points to are recognized. You are given an array of 30000 bytes and a pointer that can be moved within the array. You can increment or decrement, by one, the byte at the pointer, as well as input or output any byte at the pointer. When the pointer equals zero, the program ends.
    Obviously, this is an imperative language. The information stored at various memory locations are changed with a step-by-step execution of the statements. In fact, Brainf*ck closely resembles an assembly language with its direct manipulation of memory contents.

https://www2.gvsu.edu/miljours/bf.html


kai huang
信念

發現個有趣的觀點,如果站在極度利己的角度選擇不婚或者不孕,那麼對應到馬斯洛金字塔,其實在 Esteem 這層其實就夠了。因為往下看,個人層面的其實不需要到最頂部也都可以滿足。換句話說,如果不在意社會文化的制約,生小孩或者結婚某部分來說都是超越個人的東西,只有自金字塔的頂端才會到符合。有什麼東西能讓一個人放棄自己的利益,選擇奉獻與犧牲給另一半或者後代,我看也只能回歸信念與文化觀念了。

kai huang
結論

有時候,兩個人分別透過不同的推理過程,得出相同的結果。雙方都覺得自己是對的,最後發現兩個人都是錯的。

kai huang
Discreet Music

"Discreet Music" is an ambient music album by the English musician and producer Brian Eno, released in 1975. The album consists of two long-form compositions, both of which are built around tape loops and other electronic manipulations.

The title track, "Discreet Music," is a 30-minute piece that was created using a series of overlapping tape loops of different lengths and durations. The loops were then played through a graphic equalizer, which allowed Eno to manipulate the tonal qualities of the sound. The resulting piece is a slow, meditative work that is characterized by its gentle, shimmering textures and slowly-evolving harmonies.

The second track on the album, "Three Variations on the Canon in D Major by Johann Pachelbel," is a series of variations on a well-known piece of baroque music. Eno recorded a performance of the canon on a synthesizer, and then used tape loops and other electronic manipulations to create a series of variations that gradually drift away from the original melody.

"Discreet Music" is widely regarded as one of the pioneering works of ambient music, a genre that Eno is credited with coining. The album's use of tape loops, slow-moving textures, and electronic manipulations created a new kind of music that was designed to be listened to as an atmospheric backdrop, rather than as a foregrounded musical statement.

kai huang
NYC 2023

1. 一對在NBC樓下拍聖誕婚紗照的夫妻,感覺之前也是在同個地點認識的

2. 過了一年,媒體不再大肆報導烏俄戰爭,但有些人還記得

3. 去了一家餐館,服務生一副屌樣說桌子全訂滿了,讓我們現場等吧台位置,但我看餐廳還有一半桌子是空的,於是我掏出手機立刻線上訂位。服務員看了摸摸鼻子,給了我們一張桌子,說用餐時間一個半小時。比我們早到的其他人還在鏡子裏等待中。

4. 冬天接上太冷,很多遊民都躲在地鐵上,線下移動旅店,不知道他的脖子還好嗎?

5. 不知道是什麼樣的危難?候選人名字很美式

6. 紐約街道確實挺髒的,垃圾直接仍路上等人收,一個local表示這很正常,扔路上是因為很多食物、傢俱可以給流浪漢來撿,但我覺得就是鬼扯,城市衛生做的很糟,萬年老bug修不動了。

7. 自從紐約大麻合法化,滿街大麻味,比煙味還常見。或許以後路上被搶,可以直接offer強盜一根菸草一起放輕鬆,降低城市犯罪率。

8. 一個可以買BTC的ATM,但誰要在街上買幣,我們要幣提現金。不過街頭風挺符合crypto的精神。

9. Time square跨年沒去成,好位置得早上十一點多到,還不能離開和上廁所,憋半天確實有一年的感覺。不過有兩張confetti 是我們貢獻的,上面寫的是提前一周在街上讓人隨便寫新年願望的便條紙。(但被人踩在腳下)

kai huang